**Leen Lambers Sebastián Uchitel (Eds.)**

# **Fundamental Approaches to Software Engineering**

**26th International Conference, FASE 2023 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2023 Paris, France, April 22–27, 2023 Proceedings**

# Lecture Notes in Computer Science 13991

Founding Editors

Gerhard Goos, Germany Juris Hartmanis, USA

# Editorial Board Members

Elisa Bertino, USA Wen Gao, China

Bernhard Steffen , Germany Moti Yung , USA

# Advanced Research in Computing and Software Science Subline of Lecture Notes in Computer Science

Subline Series Editors

Giorgio Ausiello, University of Rome 'La Sapienza', Italy Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board

Susanne Albers, TU Munich, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen , University of Dortmund, Germany Deng Xiaotie, Peking University, Beijing, China Jeannette M. Wing, Microsoft Research, Redmond, WA, USA More information about this series at https://link.springer.com/bookseries/558

Leen Lambers • Sebastián Uchitel Editors

# Fundamental Approaches to Software Engineering

26th International Conference, FASE 2023 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2023 Paris, France, April 22–27, 2023 Proceedings

Editors Leen Lambers Brandenburg University of Technology Cottbus-Senftenberg Cottbus, Germany

Sebastián Uchitel CONICET/University of Buenos Aires Buenos Aires, Argentina Imperial College London London, UK

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-30825-3 ISBN 978-3-031-30826-0 (eBook) https://doi.org/10.1007/978-3-031-30826-0

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# ETAPS Foreword

Welcome to the 26th ETAPS! ETAPS 2023 took place in Paris, the beautiful capital of France. ETAPS 2023 was the 26th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference established in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each conference has its own Program Committee (PC) and its own Steering Committee (SC). The conferences cover various aspects of software systems, ranging from theoretical computer science to foundations of programming languages, analysis tools, and formal approaches to software engineering. Organising these conferences in a coherent, highly synchronized conference programme enables researchers to participate in an exciting event, having the possibility to meet many colleagues working in different directions in the field, and to easily attend talks of different conferences. On the weekend before the main conference, numerous satellite workshops took place that attracted many researchers from all over the globe.

ETAPS 2023 received 361 submissions in total, 124 of which were accepted, yielding an overall acceptance rate of 34.3%. I thank all the authors for their interest in ETAPS, all the reviewers for their reviewing efforts, the PC members for their contributions, and in particular the PC (co-)chairs for their hard work in running this entire intensive process. Last but not least, my congratulations to all authors of the accepted papers!

ETAPS 2023 featured the unifying invited speakers Véronique Cortier (CNRS, LORIA laboratory, France) and Thomas A. Henzinger (Institute of Science and Technology, Austria) and the conference-specific invited speakers Mooly Sagiv (Tel Aviv University, Israel) for ESOP and Sven Apel (Saarland University, Germany) for FASE. Invited tutorials were provided by Ana-Lucia Varbanescu (University of Twente and University of Amsterdam, The Netherlands) on heterogeneous computing and Joost-Pieter Katoen (RWTH Aachen, Germany and University of Twente, The Netherlands) on probabilistic programming.

As part of the programme we had the second edition of TOOLympics, an event to celebrate the achievements of the various competitions or comparative evaluations in the field of ETAPS.

ETAPS 2023 was organized jointly by Sorbonne Université and Université Sorbonne Paris Nord. Sorbonne Université (SU) is a multidisciplinary, research-intensive and worldclass academic institution. It was created in 2018 as the merge of two first-class research-intensive universities, UPMC (Université Pierre and Marie Curie) and Paris-Sorbonne. SU has three faculties: humanities, medicine, and 55,600 students (4,700 PhD students; 10,200 international students), 6,400 teachers, professor-researchers and 3,600 administrative and technical staff members. Université Sorbonne Paris Nord is one of the thirteen universities that succeeded the University of Paris in 1968. It is a major teaching and research center located in the north of Paris. It has five campuses, spread over the two departments of Seine-Saint-Denis and Val d'Oise: Villetaneuse, Bobigny, Saint-Denis, the Plaine Saint-Denis and Argenteuil. The university has more than 25,000 students in different fields, such as health, medicine, languages, humanities, and science. The local organization team consisted of Fabrice Kordon (general co-chair), Laure Petrucci (general co-chair), Benedikt Bollig (workshops), Stefan Haar (workshops), Étienne André (proceedings and tutorials), Céline Ghibaudo (sponsoring), Denis Poitrenaud (web), Stefan Schwoon (web), Benoît Barbot (publicity), Nathalie Sznajder (publicity), Anne-Marie Reytier (communication), Hélène Pétridis (finance) and Véronique Criart (finance).

ETAPS 2023 is further supported by the following associations and societies: ETAPS e.V., EATCS (European Association for Theoretical Computer Science), EAPLS (European Association for Programming Languages and Systems), EASST (European Association of Software Science and Technology), Lip6 (Laboratoire d'Informatique de Paris 6), LIPN (Laboratoire d'informatique de Paris Nord), Sorbonne Université, Université Sorbonne Paris Nord, CNRS (Centre national de la recherche scientifique), CEA (Commissariat à l'énergie atomique et aux énergies alternatives), LMF (Laboratoire méthodes formelles), and Inria (Institut national de recherche en informatique et en automatique).

The ETAPS Steering Committee consists of an Executive Board, and representatives of the individual ETAPS conferences, as well as representatives of EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofroň (Prague), Barbara König (Duisburg), Thomas Noll (Aachen), Caterina Urban (Inria), Jan Křetínský (Munich), and Lenore Zuck (Chicago).

Other members of the steering committee are: Dirk Beyer (Munich), Luís Caires (Lisboa), Ana Cavalcanti (York), Bernd Finkbeiner (Saarland), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), Naoki Kobayashi (Tokyo), Fabrice Kordon (Paris), Laura Kovács (Vienna), Orna Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), Andrzej Murawski (Oxford), Laure Petrucci (Paris), Elizabeth Polgreen (Edinburgh), Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella (Edinburgh), Natasha Sharygina (Lugano), Pawel Sobocinski (Tallinn), Sebastián Uchitel (London and Buenos Aires), Andrzej Wasowski (Copenhagen), Stephanie Weirich (Pennsylvania), Thomas Wies (New York), Anton Wijs (Eindhoven), and James Worrell (Oxford).

I would like to take this opportunity to thank all authors, keynote speakers, attendees, organizers of the satellite workshops, and Springer-Verlag GmbH for their support. I hope you all enjoyed ETAPS 2023.

Finally, a big thanks to Laure and Fabrice and their local organization team for all their enormous efforts to make ETAPS a fantastic event.

April 2023 Marieke Huisman ETAPS SC Chair ETAPS e.V. President

# Preface

This book contains the proceedings of FASE 2023, the 26th International Conference on Fundamental Approaches to Software Engineering, held in Paris, France, in April 2023, as part of the annual European Joint Conferences on Theory and Practice of Software (ETAPS 2023).

FASE is concerned with the foundations on which software engineering is built. We solicited four categories of papers, research, empirical, new ideas and emerging results, and tool demonstrations; all of which should make novel contributions to making software engineering a more mature and soundly-based discipline.

The contributions accepted for presentation at the conference were carefully selected by means of a thorough double-blind review process which included no less than 3 reviews per paper. We received 50 submissions, which after a reviewing period of nine weeks and intensive discussion resulted in 16 accepted papers, representing a 32% acceptance rate.

We also ran an artifact track where authors of accepted papers optionally submitted artifacts described in their papers for evaluation. 10 artifacts were submitted for evaluation, 8 of which were successfully evaluated.

In addition, FASE 2023 hosted the 5th International Competition on Software Testing (Test-Comp 2023), which is an annual comparative evaluation of automatic tools for test generation. A total of 13 tools participated this year, from seven countries. The tools were developed in academia and in industry. The submitted tools and the submitted system-description papers were reviewed by a separate program committee: the Test-Comp jury. Each tool and paper was assessed by at least three reviewers. These proceedings contain the competition report and one selected system description of a participating tool. Two sessions in the FASE program were reserved for the presentation of the results: the summary by the Test-Comp chair and of the participating tools by the developer teams in the first session, and the community meeting in the second session.

We thank the ETAPS 2023 general chair, Marieke Huisman, the ETAPS 2023 organizers, Fabrice Kordon and Laure Petrucci, as well as the FASE SC chair, Andrzej Wasowksi, for their support during the whole process. We thank our invited speaker, Sven Apel, for his keynote. We thank all the authors for their hard work and willingness to contribute. We thank all the Program Committee members, external reviewers, who invested time and effort in the selection process to ensure the scientific quality of the program. Last but not least, we thank the Test-Comp chair Dirk Beyer, viii Preface

the artifact evaluation committee chairs, Marie-Christine Jakobs and Carlos Diego Nascimento Damasceno, and their evaluation committees.

April 2023 Leen Lambers Sebastián Uchitel

# Organization

# FASE—Program Committee Chairs


# FASE—Steering Committee Chair


# FASE—Steering Committee


# FASE—Program Committee



# FASE—Artifact Evaluation Committee Chairs


# FASE—Artifact Evaluation Committee


# Test-Comp—Program Committee and Jury



# FASE—Additional Reviewers

Babikian, Aren Baranov, Eduard Barnett, Will Baxter, James Bubel, Richard Chen, Boqi d'Aloisio, Giordano Damasceno, Carlos Diego Nascimento David, Istvan De Boer, Frank Din, Crystal Chang Faqrizal, Irman Feng, Nick Hu, Caroline Jongmans, Sung-Shik Kamburjan, Eduard

Kobialka, Paul Lang, Frédéric Lazreg, Sami Lina Marsso, Nick Feng Marussy, Kristóf Marzi, Francesca Metongnon, Lionel Pun, Violet Ka I Raz, Orna Schivo, Stefano Schlatte, Rudolf Soueidi, Chukri Ye, Kangfeng Zavattaro, Gianluigi Ziv, Avi

# Brains on Code: Towards a Neuroscientific Foundation of Program Comprehension (Abstract of an Invited Talk)

### Sven Apel

Saarland University, Saarland Informatics Campus

Abstract. Research on program comprehension has a fundamental limitation: program comprehension is a cognitive process that cannot be directly observed, which leaves considerable room for misinterpretation, uncertainty, and confounders. In the project Brains On Code, we are developing a neuroscientific foundation of program comprehension. Instead of merely observing whether there is a difference regarding program comprehension (e.g., between two programming methods), we aim at precisely and reliably determining the key factors that cause the difference. This is especially challenging as humans are the subjects of study, and inter-personal variance and other confounding factors obfuscate the results. The key idea of Brains On Code is to leverage established methods from cognitive neuroscience to obtain insights into the underlying processes and influential factors of program comprehension.

Brains On Code pursues a multimodal approach that integrates different neuro-physiological measures as well as a cognitive computational modeling approach to establish the theoretical foundation. This way, Brains On Code lays the foundations of measuring and modeling program comprehension and offers substantial feedback for programming methodology, language design, and education. With Brains On Code, addressing longstanding foundational questions such as "How can we reliably measure program comprehension?", "What makes a program hard to understand?", and "What skills should programmers have?" comes into reach. Brains On Code does not only help answer these questions, but also provides an outline for applying the methodology beyond program code (models, specifications, requirements, etc.).

Keywords: Program comprehension Neuro-imaging Computational cognitive modeling

# Contents

### Regular Contributions



xvi Contents

FuSeBMC\_IA: Interval Analysis and Methods for Test Case Generation (Competition Contribution). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Mohannad Aldughaim, Kaled M. Alshmrany, Mikhail R. Gadelha, Rosiane de Freitas, and Lucas C. Cordeiro


# **Regular Contributions**

# ACoRe: Automated Goal-Conflict Resolution

Luiz Carvalho<sup>1</sup> , Renzo Degiovanni<sup>1</sup> , Mat´ıas Brizzio<sup>2</sup>,<sup>3</sup> , Maxime Cordy<sup>1</sup> , Nazareno Aguirre<sup>4</sup> , Yves Le Traon<sup>1</sup> , Mike Papadakis<sup>1</sup>

<sup>1</sup> SnT, University of Luxembourg, Luxembourg City, Luxembourg {luiz.carvalho,renzo.degiovanni,maxime.cordy,yves.traon,mike.papadakis}@uni.lu 2 IMDEA Software Institute, Madrid, Spain

matias.brizzio@imdea.org

<sup>3</sup> Universidad Polit´ecnica de Madrid, Madrid, Spain

<sup>4</sup> Universidad Nacional de R´ıo Cuarto and CONICET, R´ıo Cuarto, Argentina naguirre@dc.exa.unrc.edu.ar

Abstract. System goals are the statements that, in the context of software requirements specification, capture how the software should behave. Many times, the understanding of stakeholders on what the system should do, as captured in the goals, can lead to different problems, from clearly contradicting goals, to more subtle situations in which the satisfaction of some goals inhibits the satisfaction of others. These latter issues, called goal divergences, are the subject of goal conflict analysis, which consists of identifying, assessing, and resolving divergences, as part of a more general activity known as goal refinement.

While there exist techniques that, when requirements are expressed formally, can automatically identify and assess goal conflicts, there is currently no automated approach to support engineers in resolving identified divergences. In this paper, we present ACoRe, the first approach that automatically proposes potential resolutions to goal conflicts, in requirements specifications formally captured using linear-time temporal logic. ACoRe systematically explores syntactic modifications of the conflicting specifications, aiming at obtaining resolutions that disable previously identified conflicts, while preserving specification consistency. ACoRe integrates modern multi-objective search algorithms (in particular, NSGA-III, WBGA, and AMOSA) to produce resolutions that maintain coherence with the original conflicting specification, by searching for specifications that are either syntactically or semantically similar to the original specification.

We assess ACoRe on 25 requirements specifications taken from the literature. We show that ACoRe can successfully produce various conflict resolutions for each of the analyzed case studies, including resolutions that resemble specification repairs manually provided as part of conflict analyses.

### 1 Introduction

Many software defects that come out during software development originate from incorrect understandings of what the software being developed should do [24]. These kinds of defects are known to be among the most costly to fix, and thus it is widely acknowledged that software development methodologies must involve phases that deal with the elicitation, understanding, and precise specification of software requirements. Among the various approaches to systematize this requirements phase, the so-called goal-oriented requirements engineering (GORE) methodologies [13,55] provide techniques that organize the modeling and analysis of software requirements around the notion of system goal. Goals are prescriptive statements that capture how the software to be developed should behave, and in GORE methodologies are subject to various activities, including goal decomposition, refinement, and the assignment of goals [3,13,15,39,55,56].

The characterization of requirements as formally specified system goals enables tasks that can reveal flaws in the requirements. Formally specified goals allow for the analysis and identification of goal divergences, situations in which the satisfaction of some goals inhibits the satisfaction of others [9,16]. These divergences arise as a consequence of goal conflicts. A conflict is a condition whose satisfaction makes the goals inconsistent. Conflicts are dealt with through goalconflict analysis [58], which comprises three main stages: (i) the identification stage, which involves the identification of conflicts between goals; (ii) the assessment stage, aiming at evaluating and prioritizing the identified conflicts according to their likelihood and severity; and (iii), the resolution stage, where conflicts are resolved by providing appropriate countermeasures and, consequently, transforming the goal model, guided by the criticality level.

Goal conflict analysis has been the subject of different automated techniques to assist engineers, especially in the conflict identification and assessment phases [16,18,43,56]. However, no automated technique has been proposed for dealing with goal conflict resolution. In this paper, we present ACoRe, the first automated approach that deals with the goal-conflict resolution stage. ACoRe takes as input a set of goals formally expressed in Linear-Time Temporal Logic (LTL) [45], together with previously identified conflicts, also given as LTL formulas. It then searches for candidate resolutions, i.e., syntactic modifications to the goals that remain consistent with each other, while disabling the identified conflicts. More precisely, ACoRe employs modern search-based algorithms to efficiently explore syntactic variants of the goals, guided by a syntactic and semantic similarity with the original goals, as well as with the inhibition of the identified conflicts. This search guidance is implemented as (multi-objective) fitness functions, using Levenshtein edit distance [42] for syntactic similarity, and approximated LTL model counting [8] for semantic similarity. ACoRe exploits this fitness function to search for candidate resolutions, using various alternative search algorithms, namely a Weight-Based Genetic Algorithm (WBGA) [29], a Non-dominated Sorted Genetic Algorithm (NSGA-III) [14], an Archived Multi-Objective Simulated Annealing search (AMOSA) [6], and an unguided search approach, mainly used as a baseline in our experimental evaluations.

Our experimental evaluation considers 25 requirements specifications taken from the literature, for which goal conflicts are automatically computed [16]. The results show that ACoRe is able to successfully produce various conflict resolutions for each of the analysed case studies, including resolutions that resemble specification repairs manually provided as part of conflict analyses. In this assessment, we measured their similarity concerning the ground-truth, i.e., to the manually written repairs, when available. The genetic algorithms are able to resemble 3 out of 8 repairs in the ground truth. Moreover, the results show that ACoRe generates more non-dominated resolutions (their finesses are not subsumed by other repairs in the output set) when adopting genetic algorithms (NSGA-III or WBGA), compared to AMOSA or unguided search, favoring genetic multi-objective search over other approaches.

### 2 Linear-Time Temporal Logic

### 2.1 Language Formalism

Linear-Time Temporal Logic (LTL) is a logical formalism widely used to specify reactive systems [45]. In addition, GORE methodologies (e.g. KAOS) have also adopted LTL to formally express requirements [55] and taken advantage of the powerful automatic analysis techniques associated with LTL to improve the quality of their specifications (e.g., to identify inconsistencies [17]).

Definition 1 (LTL Syntax). Let AP be a set of propositional variables. LTL formulas are inductively defined using the standard logical connectives, and the temporal operators (next) and U (until), as follows:


LTL formulas are interpreted over infinite traces of the form σ = s<sup>0</sup> s<sup>1</sup> . . ., where each s<sup>i</sup> is a propositional valuation on 2AP (i.e., <sup>σ</sup> <sup>∈</sup> <sup>2</sup> AP <sup>ω</sup> ).

Definition 2 (LTL Semantic). We say that trace σ = s0, s1, . . . satisfies a formula ϕ, written σ |= ϕ, if and only if ϕ holds at the initial state of the trace, i.e. (σ, 0) |= ϕ. The last notion is inductively defined on the shape of ϕ as follows:

(a) (σ, i) |= p ⇔ p ∈ s<sup>i</sup> (b) (σ, i) |= (φ ∨ ψ) ⇔ (σ, i) |= φ or (σ, i) |= ψ (c) (σ, i) |= ¬φ ⇔ (σ, i) 6|= φ (d) (σ, i) |= φ ⇔ (σ, i + 1) |= φ (e) (σ, i) |= (φ U ψ) ⇔ ∃<sup>k</sup>≥<sup>0</sup> : (σ, k) |= ψ and ∀<sup>0</sup>≤j<k : (σ, j) |= φ

Intuitively, formulas with no temporal operator are evaluated in the first state of the trace. Formula ϕ is true at position i, iff ϕ is true in position i+ 1. Formula ϕU ψ is true in σ iff formula ϕ holds at every position until ψ holds.

Definition 3 (Satisfiability). An LTL formula ϕ is said satisfiable (SAT) iff there exists at least one trace satisfying ϕ.

We also consider other typical connectives and operators, such as, ∧, ✷ (always), ✸ (eventually) and W (weak-until), that are defined in terms of the basic ones. That is, φ ∧ ψ ≡ ¬(¬φ ∨ ¬ψ), ✸φ ≡ trueUφ, ✷φ ≡ ¬✸¬φ, and φWψ ≡ (✷φ) ∨ (φUψ).

### 2.2 Model Counting

The model counting problem consists of calculating the number of models that satisfy a formula. Since the models of LTL formulas are infinite traces, it is often the case that analysis is restricted to a class of canonical finite representation of infinite traces, such as lasso traces or tree models. Notably, this is the case in bounded model checking for instance [7].

Definition 4 (Lasso Trace). A lasso trace σ is of the form σ = s<sup>0</sup> . . . si(si+1 . . . sk) <sup>ω</sup>, where the states s<sup>0</sup> . . . s<sup>k</sup> conform the base of the trace, and the loop from state s<sup>k</sup> to state si+1 is the part of the trace that is repeated infinitely many times.

For example, an LTL formula ✷(p ∨ q) is satisfiable, and one satisfying lasso trace is σ<sup>1</sup> = {p}; {p, q} <sup>ω</sup>, wherein the first state p holds, and from the second state both p and q are valid forever. Notice that the base in the lasso trace σ<sup>1</sup> is the sequence containing both states {p}; {p, q}, while the state {p, q} is the sequence in the loop part.

Definition 5 (LTL Model Counting). Given an LTL formula ϕ and a bound k, the (bounded) model counting problem consists in computing how many lasso traces of at most k states exist for ϕ. We denote this as #(ϕ, k).

Since existing approaches for computing the exact number of lasso traces are ineffective [25], Brizzio et. al [8] recently developed a novel model counting approach that approximates the number (of prefixes) of lasso traces satisfying an LTL formula. Intuitively, instead of counting the number of lasso traces of length k, the approach of Brizzio et. al [8] aims at approximating the number of bases of length k corresponding to some satisfying lasso trace.

Definition 6 (Approximate LTL Model Counting). Given an LTL formula ϕ and a bound k, the approach of Brizzio et. al [8] approximates the number of bases w = s<sup>0</sup> . . . sk, such that for some i, the lasso trace σ = s<sup>0</sup> . . . (s<sup>i</sup> . . . sk) ω satisfies ϕ (notice that prefix w is the base of σ). We denote #Approx(ϕ, k) to the number computed by this approximation.

ACoRe uses #Approx model counting to compute the semantic similarity between the original specification and the candidate goal-conflict resolutions.

### 3 The Goal-Conflict Resolution Problem

Goal-Oriented Requirements Engineering (GORE) [55] drives the requirements process in software development from the definition of high-level goals that state how the system to be developed should behave. Particularly, goals are prescriptive statements that the system should achieve within a given domain. The domain properties are descriptive statements that capture the domain of the problem world. Typically, GORE methodologies use a logical formalism to specify the expected system behavior, e.g., KAOS uses Linear-Time Temporal Logic for specifying requirements [55]. In this context, a conflict essentially represents a condition whose occurrence results in the loss of satisfaction of the goals, i.e., that makes the goals diverge [56,57]. Formally, it can be defined as follows.

Definition 7 (Goal Conflicts). Let G = {G1, . . . , Gn} be a set of goals, and Dom be a set of domain properties, all written in LTL. Goals in G are said to diverge if and only if there exists at least one Boundary Condition (BC), such that the following conditions hold:


Intuitively, a BC captures a particular combination of circumstances in which the goals cannot be satisfied. The first condition establishes that, when BC holds, the conjunction of goals {G1, . . . , Gn} becomes inconsistent. The second condition states that, if any of the goals are disregarded, then consistency is recovered. The third condition prohibits a boundary condition to be simply the negation of the goals. Also, the minimality condition prohibits that BC be equals to false (it has to be consistent with the domain Dom).

j=6 i

Goal-conflict analysis [55,56] deals with these issues, through three main stages: (1) The goal-conflicts identification phase consists in generating boundary conditions that characterize divergences in the specification; (2) The assessment stage consists in assessing and prioritizing the identified conflicts according to their likelihood and severity; (3) The resolution stage consists in resolving the identified conflicts by providing appropriate countermeasures. Let us consider the following examples found in our empirical evaluation and commonly presented in related works.

Example 1 (Mine Pump Controller - MPC). Consider the Mine Pump Controller (MPC) widely used in related works that deal with formal requirements and reactive systems [16,35]. The MPC describes a system that is in charge of activating or deactivating a pump (p) to remove the water from the mine, in the presence of possible dangerous scenarios. The MP controller monitors environmental magnitudes related to the presence of methane (m) and the high level of water (h) in the mine. Maintaining a high level of water for a while may produce

flooding in the mine, while the methane may cause an explosion when the pump is switched on. Hence, the specification for the MPC is as follows:

$$Dom: \square((p \land \bigcirc(p)) \to \bigcirc(\bigcirc(\neg h)) \quad G\_1: \square(m \to \bigcirc(\neg p)) \quad G\_2: \square(h \to \bigcirc(p))$$

Domain property Dom describes the impact into the environment of switching on the pump (p). For instance, when the pump is kept on for 2 unit times, then the water will decrease and the level will not be high (¬h). Goal G<sup>1</sup> expresses that the pump should be off when methane is detected in the mine. Goal G<sup>2</sup> indicates that the pump should be on when the level of water is high.

Notice that this specification is consistent, for instance, in cases in which the level of water never exceeds the high threshold. However, approaches for goal-conflict identification, such as the one of Degiovanni et al. [16], can detect a conflict between goals in this specification.

The identified goal-conflict describes a divergence situation in cases in which the level of water is high and methane is present at the same time in the environment. Switching off the pump to satisfy G<sup>1</sup> will result in a violation of goal G2; while switching on the pump to satisfy G<sup>2</sup> will violate G1. This divergence situation clearly evidence a conflict between goals G<sup>1</sup> and G<sup>2</sup> that is captured by a boundary condition such BC = ✸(h ∧ m).

In the work of Letier et al. [40] two resolutions were manually proposed that precisely describe what should be the software behaviour in cases where the divergence situation is reached. The first resolution proposes to refine goal G2, by weakening it, requiring to switch on the pump only when the level of water is high and no methane is present in the environment.

Example 2 (Resolution 1 - MPC).

$$\begin{aligned} Dom &: \Box((p \land \bigcirc(p)) \to \bigcirc(\bigcirc(\neg h))) \\ G\_1 &: \Box(m \to \bigcirc(\neg p)) \quad G\_2' &: \Box(h \land \neg m \to \bigcirc(p)) \end{aligned}$$

With a similar analysis, the second resolution proposes to weaken G1, requiring switching off the pump when methane is present and the level of water is not high.

Example 3 (Resolution 2 - MPC).

$$\begin{aligned} \text{Dom} &: \Box((p \land \bigcirc(p)) \to \bigcirc(\bigcirc(\neg h))) \\ G\_1' &: \Box(m \land \neg h \to \bigcirc(\neg p)) \quad G\_2 : \Box(h \to \bigcirc(p)) \end{aligned}$$

The resolution stage aims at removing the identified goal-conflicts from the specification, for which it is necessary to modify the current specification formulation. This may require weakening or strengthening the existing goals, or even removing some and adding new ones.

Definition 8 (Goal-Conflict Resolution). Let G = {G1, . . . , Gn}, Dom, and BC be the set of goals, the domain properties, and an identified boundary condition, respectively written in LTL. Let M : S<sup>1</sup> × S<sup>2</sup> 7→ [0, 1] and ∈ [0, 1] be a similarity metric between two specifications and a threshold, respectively. We say that a resolution R = {R1, . . . , Rm} resolves goal-conflict BC, if and only if, the following conditions hold:


Intuitively, the first condition states that the refined goals in R remain consistent within the domain properties Dom. The second condition states that BC does not lead to a divergence situation in the resolution R (i.e., refined goals in R know exactly how to deal with the situations captured by BC). Finally, the last condition aims at using a similarity metric M to control for the degree of changes applied to the original formulation of goals in G to produce the refined goals in resolution R.

Notice that the similarity metric M is general enough to capture similarities between G and R of different natures. For instance, M(G, R) may compute the syntactic similarity between the text representations of the original specification of goals in G and the candidate resolution R, where the number of tokens edited from G to R is the aim. On the other hand, M(G, R) may compute a semantic similarity between G and R, for instance, to favour resolutions that weaken the goals (i.e. G → R), or strengthen the goals (i.e. R → G) or that maintain most of the original behaviours (i.e. #G − #R < ).

Precisely, ACoRe will explore syntactic modifications of goals from G, leading to newly refined goals in R, with the aim at producing candidate resolutions that are consistent with the domain properties Dom and resolve conflict BC. Assuming that the engineer is competent and the current specification is very close to the intended one [19,1], ACoRe will integrate two similarity metrics in a multi-objective search process to produce resolutions that are syntactically and semantically similar to the original specification. Particularly, ACoRe can generate exactly the same resolutions for the MPC previously discussed, manually developed by Letier et al. [40].

### 4 ACoRe: Automated Goal-Conflict Resolution

ACoRe takes as input a specification S = (Dom, G), composed by the domain properties Dom, a set of goals G, and a set {BC1, . . . , BCk} of identified boundary conditions for S. ACoRe uses search to iteratively explore variants of G to produce a set <sup>R</sup> <sup>=</sup> {R1, . . . , Rn} of resolutions, where each <sup>R</sup><sup>i</sup> = (Dom, G<sup>i</sup> ), that maintain two sorts of similarities with the original specification, namely, syntactic and semantic similarity between S and each R<sup>i</sup> . Figure 1 shows an overview of the different steps of the search process implemented by ACoRe.

ACoRe instantiates multi-objective optimization (MOO) algorithms to efficiently and effectively explore the search space. Currently, ACoRe implements four MOO algorithms, namely, the Non-Dominated Sorting Genetic Algorithm III (NSGA-III) [14], a Weight-based genetic algorithm (WBGA) [29], an Archived Multi-objective Simulated Annealing (AMOSA) [6] approach, and an unguided search approach we use as a baseline. Let us first describe some common components shared by the algorithms (namely, the search space, the multi-objectives, and the evolutionary operators) and then get into the particular details of each approach (such as the fitness function and selection criteria).

Fig. 1: Overview of ACoRe.

### 4.1 Search Space and Initial Population

Each individual cR = (Dom, G<sup>0</sup> ), representing a candidate resolution, is a LTL specification over a set AP of propositional variables, where Dom captures the domain properties and G<sup>0</sup> the refined system goals. Notice that domain properties Dom are not changed through the search process since these are descriptive statements. On the other hand, ACoRe performs syntactic alterations to the original set of goals G to obtain the new set of refined goals G<sup>0</sup> that potentially resolve the conflicts given as input.

The initial population represents a sample of the search space from which the search starts. ACoRe creates one or more individuals (depending on the multi-objective algorithm being used) as the initial population by applying the mutation operator (explained below) to the specification S given as input.

### 4.2 Multi-Objectives: Consistency, Resolution and Similarities

ACoRe guides the search with four objectives that check for the validity of each of the conditions needed to be a valid goal-conflict resolution, namely, consistency, resolution and two similarity metrics (cf. Definition 8).

Given a resolution cR = (Dom, G<sup>0</sup> ), the first objective Consistency(cR) evaluates if the refined goals G<sup>0</sup> are consistent with the domain properties by using SAT solving.

$$Consistency(cR) = \begin{cases} 1 & \text{if } Dom \land G' \text{ is satisfiable} \\ 0.5 & \text{if } Dom \land G' \text{ is unsatisfiable, but } G' \text{ is satisfiable} \\ 0 & \text{if } G' \text{ is unsatisfiable} \end{cases}$$

The second objective ResolvedBCs(cR) computes the ratio of boundary conditions resolved by the candidate resolution cR, among the total number of boundary conditions given as input. Hence, ResolvedBCs(cR) returns values between 0 and 1, and is defined as follows:

$$ResolveBCs(cR) = \frac{\sum\_{i=1}^{k} is Resolved(BC\_i, G')}{k}$$

isResolved(cR, BCi) returns 1, if and only if BC<sup>i</sup> ∧ G<sup>0</sup> is satisfiable; otherwise, returns 0. Intuitively, when BC<sup>i</sup> ∧ G<sup>0</sup> is satisfiable, it means that the refined goals G<sup>0</sup> satisfies the resolution condition of Definition 8 and thus, BC<sup>i</sup> is no longer a conflict for candidate resolution cR. In the case that cR resolves all the (k) boundary conditions, the objective ResolvedBCs(cR) will return 1.

With the objective of prioritising resolutions that are in some sense similar to the original specification among the dissimilar ones, ACoRe integrates two similarity metrics. ACoRe considers one syntactic and one semantic similarity metric that will help the algorithms to focus the search in the vicinity of the specification given as input.

Precisely, objective Syntactic(S, cR) refers to the distance between the text representations of the original specification S and the candidate resolution cR. To compute the syntactic similarity between LTL specifications, we use Levenshtein distance [42]. Intuitively, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. Hence, Syntactic(S, cR), is computed as:

$$Syntactic(S, cR) = \frac{maxLength - Levenshtein(S, cR)}{maxLength}$$

where maxLength = max(length(S), length(cR)). Intuitively, Syntactic(S, cR) represents the ratio between the number of tokens changed from S to obtain cR among the maximum number of tokens corresponding to the largest specification.

On the other hand, our semantic similarity objective Semantic(S, cR) refers to the system behaviour similarities described by the original specification and the candidate resolution. Precisely, Semantic(S, cR) computes the ratio between the number of behaviours present in both, the original specification and candidate resolution, among the total number of behaviours described by the specifications. To efficiently compute the objective Semantic(S, cR), ACoRe uses model counting and the approximation previously described in Definition 6. Hence, given a bound k for the lasso traces, the semantic similarity between S and cR is computed as:

$$Semantic(S, cR) = \frac{\#\text{APROM}(S \land cR, k)}{\#\text{APROM}(S \lor cR, k)}$$

Notice that, small values for Semantic(S, cR) indicate that the behaviours described by S are divergent from those described by cR. In particular, in cases that S and cR are contradictory (i.e., S∧cR is unsatisfiable), Semantic(S, cR) is 0. As this value gets closer to 1, both specifications characterize an increasingly large number of common behaviors.

### 4.3 Evolutionary Operators

New individuals are generated through the application of the evolution operators. Particularly, our approach ACoRe implements two standard operators used for evolving LTL specifications [17,43], namely a mutation and a crossover operators. Below, we provide some examples of the application of these operators, and please refer to the complementary material for a detailed formal definition.

Fig. 3: Crossover operator.

Given a candidate individual cR<sup>0</sup> = (Dom, G<sup>0</sup> ), the mutation operator selects a goal g <sup>0</sup> ∈ G<sup>0</sup> to mutate, leading to a new goal g <sup>00</sup>, and produces a new candidate specification cR<sup>00</sup> = (Dom, G00), where G<sup>00</sup> = G<sup>0</sup> [g 0 7→ g <sup>00</sup>], that is, G<sup>00</sup> looks exactly as G<sup>0</sup> but goal g 0 is replaced by the mutated goal g 00 .

For instance, Figure 2 shows 5 possible mutations that we can generate for formula ✸(p → ✷r). Mutation M1 replaces ✸ by ✷, leading to M1 : ✷(p → ✷r). Mutation M2 : ✸(p ∧ ✷r) replaces → by ∧. Mutation M3 : ✸(p → ¬r) replaces ✷ by ¬. Mutation M4 : ✸(true → ✷r), reduces to ✸✷r, replaces p by true. While mutation M5 : ✸(p → ✷q) replaces r by q.

On the contrary, the crossover operator takes two individuals cR<sup>1</sup> = (Dom, G<sup>1</sup> ) and cR<sup>2</sup> = (Dom, G<sup>2</sup> ), and produces a new candidate resolution cR<sup>00</sup> = (Dom, G00) by combining portions of both specifications. In other words, it takes one goal from each individual, i.e. <sup>G</sup><sup>1</sup> <sup>∈</sup> <sup>G</sup><sup>1</sup> and <sup>G</sup><sup>2</sup> <sup>∈</sup> <sup>G</sup><sup>2</sup> , and generates a new goal G<sup>00</sup> that is obtained by replacing a subformula α of G<sup>1</sup> by a subformula β taken from G2. For instance, Figure 3 provides an illustration of how this operator works. Particularly, subformula α : p is selected from goal G<sup>1</sup> : ✸(p → ✷r), while subformula β : ¬p is selected from goal G<sup>2</sup> : ¬p∧q. Hence, by replacing in G<sup>1</sup> subformula α by subformula β, the crossover operators generate a new goal G<sup>00</sup> : ✸(¬p → ✷r).

It is worth mentioning that the four multi-objective search algorithms implemented by ACoRe use the mutation operator to evolve the population. However, only two of the algorithms that implement two different genetic algorithms (i.e. NSGA-III and WBGA) use the crossover operator to evolve the population.

#### 4.4 Multi-Objective Optimisation Search Algorithms

In a multi-objective optimisation (MOO) problem there is a set of solutions, called the Pareto-optimal (PO) set, which is considered to be equally important. Given two individuals x<sup>1</sup> and x<sup>2</sup> from the search-space S, and f1, . . . , f<sup>n</sup> a set of (maximising) fitness functions, where f<sup>i</sup> : <sup>S</sup> <sup>→</sup> <sup>R</sup>, we say that <sup>x</sup><sup>1</sup> dominates <sup>x</sup><sup>2</sup> if (a) x<sup>1</sup> is not worse than x<sup>2</sup> in all objectives and (b) x<sup>1</sup> is strictly better than x<sup>2</sup> at least in one objective. Typically, MOO algorithms evolve the candidate population with the aim to converge to a set of non-dominated solutions as close to the true PO set as possible and maintain as diverse a solution set as possible. There are many variants of MOO algorithms that have been successfully applied in practice [27]. ACoRe implements four multi-objective optimization algorithms to explore the search space to generate goal-conflict resolutions.

AMOSA. The Archived Multi-objective Simulated Annealing (AMOSA) [6] is an adaptation of the simulated annealing algorithm [34] for multi-objectives. AMOSA only analyses one (current) individual per iteration, and a new individual is created by the application of the mutation operator. AMOSA has two particular features that make it promising for our purpose. During the search, it maintains an "archive" with the non-dominated candidates explored so far, that is, candidates whose fitness values are not subsumed by other generated individuals. Moreover, when a new individual is created that does not dominate the current one, it is not immediately discarded and can still be selected among the current individual with some probability that depends on the "temperature" (a function that decreases over time). At the beginning the temperature is high, then new individuals with worse fitness than the current element, are likely to be selected, but this probability decreases over the iterations. This strategy helps in avoiding local maximums and exploring more diverse potential solutions.

WBGA. ACoRe also implements a classic Weight-based genetic algorithm (WBGA) [29]. In this case, WBGA maintains a fixed number of individuals in each iteration (a configurable parameter), and applies both the mutation and crossover operators to generate new individuals. WBGA computes the fitness value for each objective and combines them into a single fitness f defined as:

$$f(S, cR) = \alpha \* Consistency(cR) + \beta \* ResolvedBCs(cR) + \dotsb$$

$$\gamma \* Syntax(S, cR) + \delta \* Semantic(S, cR)$$

where weights α = 0.1, β = 0.7, γ = 0.1, and δ = 0.1 are defined by default (empirically validated), but these can be configured to other values if desired. In each iteration, WBGA sorts all the individuals according to their fitness value (descending order) and selects best ranked individuals to survive to the next

iteration (other selectors can be integrated). Finally, WBGA reports all the resolutions found during the search.

NSGA-III. ACoRe also implements the Non-Dominated Sorting Genetic Algorithm III (NSGA-III) [14] approach. It is a variant of a genetic algorithm that also uses mutation and crossover operators to evolve the population. In each iteration, it computes the fitness values for each individual and sorts the population according to the Pareto dominance relation. Then it creates a partition of the population according the level of the individuals in the Pareto dominance relation (i.e., non-dominated individuals are in Level-1, Level-2 contains the individuals dominated only by individuals in Level-1, and so on). Thus, NSGA-III selects only one individual per non-dominated level with the aim of diversifying the exploration and reducing the number of resolutions in the final Pareto-front.

ACoRe also implements an Unguided Search algorithm that does not use any of the objectives to guide the search. It randomly selects individuals and applies the mutation operator to evolve the population. After generating a maximum number of individuals (a given parameter of the algorithm), it checks which ones constitute a valid resolution for the goal-conflicts given as input.

### 5 Experimental Evaluation

We start our analysis by investigating the effectiveness of ACoRe in resolving goal-conflicts. Thus, we ask:

RQ1 How effective is ACoRe at resolving goal-conflicts?

To answer this question, we study the ability of ACoRe to generate resolutions in a set of 25 specifications for which we have identified goal-conflicts.

Then, we turn our attention to the "quality" of the resolution produced by ACoRe and study if ACoRe is able to replicate some of the manually written resolutions gathered from the literature (ground-truth). Thus, we ask:

### RQ2 How able is ACoRe to generate resolutions that match with resolutions provided by engineers (i.e. manually developed)?

To answer RQ2, we check if ACoRe can generate resolutions that are equivalent to the ones manually developed by the engineer.

Finally, we are interested in analyzing and comparing the performance of the four search algorithms integrated by ACoRe. Thus, we ask:

### RQ3 What is the performance of ACoRe when adopting different search algorithms?

To answer RQ3, we basically employ standard quality indicators (e.g. hypervolume (HV) and inverted generational distance (IGD)) to compare the Paretofront produced by ACoRe when the different search algorithms are employed.

#### 5.1 Experimental Procedure

We consider a total of 25 requirements specifications taken from the literature and different benchmarks. These specifications were previously used by goalconflicts identification and assessment approaches [4,16,17,18,43,56].


Table 1: LTL Requirements Specifications and Goal-conflicts Identified.

We start by running the approach of Degiovanni et al. [17] on each subject to identify a set of boundary conditions. Table 1 summarises, for each case, the number of domain properties and goals, and the number of boundary conditions (i.e. goal-conflicts) computed with the approach of Degiovanni et al. [17]. Notice that we use the set of "weakest"<sup>1</sup> boundary conditions returned by [17], in the sense that by removing all of these we are guaranteed to remove all the boundary conditions computed.

Then, we run ACoRe to generate resolutions that remove all the identified goal-conflicts. We configure ACoRe to explore a maximum number of 1000 individuals with each algorithm. We repeat this process 10 times to reduce potential threats [5] raised by the random elections of the search algorithms.

To answer RQ1, we run ACoRe and report the number of non-dominated resolutions produced by each implemented algorithm (i.e. those resolutions whose fitness values are not subsumed by other individuals).

To answer RQ2, we collected from the literature 8 cases in which authors reported a "buggy" version of the specification and a "fixed" version of the same specification. We take the buggy version and compute a set of boundary conditions for it that are later fed into ACoRe to automatically produce a set of resolutions. We then compare the resolutions produced by our ACoRe and the "fixed" versions we gathered from the literature. We basically analyse, by using sat solving, if any of the resolutions produced by ACoRe is equivalent to the manually developed fixed version.

To answer RQ3, we perform an objective comparison of the performance of the four search algorithms implemented by ACoRe by using two standard

<sup>1</sup> A formula A is weaker than B, if B ∧ ¬A is unsatisfiable, i.e., if B implies A.

quality indicators: hypervolume (HV) [62] and inverted generational distance (IGD) [12]. The recent work of Wu et al. [61] indicates that quality indicators HV and IGD are the prefered ones for assessing genetic algorithms and Pareto evolutionary algorithms such as the ones ACoRe implements (NSGA-III, WBGA, and AMOSA). These quality indicators are useful to measure the convergence, spread, uniformity, and cardinality of the solutions computed by the algorithms. More precisely, hypervolume (HV) [42,54] is a volume-based indicator, defined by the Nadir Point [38,62], that returns a value between 0 and 1, where a value near to 1 indicates that the Pareto-front converges very well to the reference point [42] (also, high values for HV are good indicator of uniformity and spread of the Pareto-front [54]). The Inverted Generational Distance (IGD) indicator is a distance-based indicator that also computes convergence and spread [42,54]. In summary, IGD measures the mean distance from each reference point to the nearest element in the Pareto-optimal set [12,54]. We also perform some statistical analysis, namely, the Kruskal-Wallis H-test [37], the Mann-Whitney U-test [44], and Vargha-Delaney A measure Aˆ <sup>12</sup> [59], to compare the performance of the algorithms. Intuitively, the p-value will tell us if the performance between the algorithms measured in terms of the HV and IGD is statistical significance, while the A-measure will tell us how frequent one algorithm obtains better indicators than the others.

ACoRe is implemented in Java into the JMetal framework [50]. It integrates the LTL satisfiability checker Polsat [41], a portfolio tool that runs in parallel with four LTL solvers, helping us to efficiently compute the fitness functions. Moreover, ACoRe uses the OwL library [36] to parse and manipulate the LTL specifications. The quality indicators also are implemented by the JMetal framework and the statistical tests by the Apache Common Math. We ran all the experiments on a cluster with nodes with Xeon E5 2.4GHz, with 5 CPUs-nodes and 8GB of RAM available per run.

Regarding the setting of the algorithms, the population size of 100 individuals was defined and the fitness evaluation was limited to a number of 1000 individuals. Moreover, the timeout of the model counting and SAT solvers were configured as 300 seconds. The probability of crossover application was 0.1, while mutation operators were always applied. A tournament selection of four solutions was used for NSGA-III, while WBGA instantiated Bolzman's selection with a decrement exponential function. The WBGA was configured to weight the fitness functions as a proportion of 0.1 in the Status, 0.7 in the ResolvedBC, 0.1 in Syntactic, and 0.1 in Semantic. The AMOSA used an archive of crowding distance, while the cooling scheme relied on a decrement exponential function.

The case studies and results are publicly available at https://sites. google.com/view/acore-goal-conflict-resolution/.

### 6 Experimental Results

### 6.1 RQ1: Effectiveness of ACoRe

Table 2 reports the average number of non-dominated resolutions produced by the algorithms in the 10 runs. First, it is worth mentioning that when ACoRe uses any of the genetic algorithms (NSGA-III or WBGA), it successfully generates at least one resolution for all the case studies. However, AMOSA fails in producing a resolution for the lily16 and simple arbiter icse2018 in 2 and 1 cases of the 10 runs, respectively. Despite that Unguided search succeeds in the majority of the cases, it was not able to produce any resolution for the prioritized arbiter, and failed in producing a resolution in 5 out of the 10 runs for the simple-arbiter-v2.

Table 2: Effectiveness of ACoRe in producing resolutions.


Table 3: ACoRe effectiveness in producing an exact or more general resolution than the manually written one.


Second, the genetic algorithms (NSGA-III and WBGA) generate on average more (non-dominated) resolutions than AMOSA and unguided search. The results point out that WBGA generates more (non-dominated) resolutions than others in 13 out of the 25 cases, and NSGA-III is the one that produces more (non-dominated) resolutions in 11 cases. Considering the genetic algorithms together, we can observe that they outperform the AMOSA and unguided search in 21 out of the 25 cases, and coincide in one case (ltl2dba R-2). Finally, the Unguided Search generates more resolutions in 3 cases, namely, detector, TCP, and retraction-pattern-1. Interestingly, the different algorithms of ACoRe produce on average between 1 and 8 non-dominated resolutions, which we consider is a reasonable number of options that the engineer can manually inspect and validate to select the most appropriate one.

ACoRe generates more non-dominated resolutions when adopting genetic algorithms. On average, ACoRe produces between 1 and 8 non-dominated resolutions that can be presented to the engineer for analysis and validation.

# 6.2 RQ2: Comparison with the Ground-truth

Table 3 presents the effectiveness of ACoRe in generating a resolution that is equivalent or more general than the ones manually developed by engineers. Overall, ACoRe is able to reproduce same resolutions in 3 out of 8 of the cases, namely, for the minepump (our running example), simple arbiter-v2, and detector. Like for RQ1, the genetic algorithms outperform AMOSA and unguided search in this respect. Particularly, the Unguided Search can replicate the resolution for the detector case, in which AMOSA fails.

Overall, the genetic algorithms can produce same or more general resolutions than the ground-truth in 3 out of the 8 cases, outperforming AMOSA (1 out of the 8) and unguided search (2 out of the 8).

# 6.3 RQ3: Comparing the Multi-objective Optimization Algorithms

For each set of non-dominated resolutions generated by the different algorithms, we compute the quality indicators HV and IGD for the syntactic and semantic similarity values. The reference point is the best possible value for each objective which is 1. These will allow us to determine which algorithm converges the most to the reference point and produces more diverse and optimal resolutions.

Fig. 4: HV of the Pareto-optimal sets generated by ACoRe.

Fig. 5: IGD of the Pareto-optimal sets generated by ACoRe.

Figures 4 and 5 show the boxplots for each quality indicator. NSGA-III obtains on average much better HV and IGD than the rest of the algorithms. Precisely, it obtains on average 0.66 of HV (while higher the better) and 0.34 of IGD (while lower the better), outperforming the other algorithms.

To confirm this result we compare the quality indicators in terms of nonparametric statistical tests: (i) Kruskal–Wallis test by ranks and (ii) the Mann-Whitney U-test. The α value defined in the Kruskal-Wallis test by ranks is 0.05 and the Mann-Whitney U-test is 0.0125. Moreover, we also complete our assessment by using Vargha and Delaney's Aˆ <sup>12</sup>, a non-parametric effect size measurement. Table 4 summarises the results when we compare pair-wise each one of the approaches. We can observe that NSGA-III in near 80% of the cases obtains resolutions with better quality indicators than AMOSA and Unguided search (and the differences are statistically significant). We can also observe that NSGA-III obtains higher HV (IGD) than WBGA in 66% (65%) of the cases. From Table 4 we can also observe that WBGA outperforms both AMOSA and unguided search. Moreover, we can observe that AMOSA is the worse performing algorithm according to the considered quality indicators.


Table 4: HV and IGD quality indicators for the generated resolutions.

Overall, both statistical tests evidence that NSGA-III leads to a set of resolutions with better quality indicators (HV and IGD) than the rest of the algorithms. WBGA is the one in the second place, outperforming the unguided search and AMOSA. While AMOSA shows the lowest performance based on the quality indicators, even worse than the unguided search in several cases.

### 7 Related Work

Several manual approaches have been proposed to identify inconsistencies between goals and resolve them once the requirements were specified. Among them, Murukannaiah et al. [49] compares a genuine analysis of competing hypotheses against modified procedures that include requirements engineer thought process. The empirical evaluation shows that the modified version presents higher completeness and coverage. Despite the increase in quality, the approach is limited to manual applicability performed by engineers as well previous approaches [56].

Various informal and semi-formal approaches [28,32,33], as well as more formal approaches [21,23,26,30,51,53], have been proposed for detecting logically inconsistent requirements, a strong kind of conflicts, as opposed to this work that focuses on a weak form of conflict, called divergences (cf. Section 3).

Moreover, recent approaches have been introduced to automatically identify goal-conflicts. Degiovanni et al. [18] introduced an automated approach where boundary conditions are automatically computed using a tableaux-based LTL satisfiability checking procedure. Since it exhibits serious scaliability issues, the work of Degiovanni et al. [17] proposes a genetic algorithm that mutates the LTL formulas in order to find boundary conditions for the goal specifications. The output of this approach can be fed into ACoRe to produce potential resolutions for the identified conflicts (as shown in the experimental evaluation).

Regarding specification repair approaches, Wang et al. [60] introduced ARepair, an automated tool to repair a faulty model formally specified in Alloy [31]. ARepair takes a faulty Alloy model and a set of failing tests and applies mutations to the model until all failing tests become passing. In the case of ACoRe, the identified goal conflicts are the ones that guide the search, and candidates are aimed to be syntactic and semantically similar to the original specification.

In the context of reactive synthesis [22,46,52], some approaches were proposed to repair imperfections in the LTL specifications that make the unrealisable ( i.e., no implementation that satisfies the specification can be synthesized). The majority of the approaches focus on learning missing assumptions about the environment that make them unrealisable [4,10,11,48]. A more recent approach [8], published in a technical report, proposes to mutate both the assumptions and guarantees (goals) until the specification becomes realisable. Precisely, we use the novel model counting approximation algorithm from Brizzio et. al [8] to compute the semantic similarity between the original buggy specification and the resolutions. However, the notion of repair for Brizzio et. al [8] requires a realizable specification, which is very general and does not necessarily lead to quality synthesized controllers [20,47]. In this work, the definition of resolution is fine-grained and focused on removing the identified conflicts, which potentially leads to interesting repairs as we showed in our empirical evaluation.

Alrajeh et al. [2] introduced an automated approach to refine a goal model when the environmental context changes. That is, if the domain properties are changed, then this approach will propose changes in the goals to make them consistent with the new domain. The adapted goal model is generated using a new counterexample-guided learning procedure that ensures the correctness of the updated goal model, preferring more local adaptations and more similar goal models. In our work, the domain properties are not changed and the adaptions are made to resolve the identified inconsistencies, and instead of counterexamples, our search is guided by syntactic and semantic similarity metrics.

## 8 Conclusion

In this paper, we presented ACoRe, the first automated approach for goalconflict resolution. Overall, ACoRe takes a goal specification and a set of conflicts previously identified, expressed in LTL, and computes a set of resolutions that removes such conflicts. To assess and implement ACoRe that is a search-based approach, we adopted three multi-objective algorithms (NSGA-III, AMOSA, and WBGA) that simultaneously optimize and deal with the trade-off among the objectives. We evaluated ACoRe in 25 specifications that were written in LTL and extracted from the related literature. The evaluation showed that the genetic algorithms (NSGA-III and WBGA) typically generate more (non-dominated) resolutions than AMOSA and an Unguided Search we implemented as a baseline in our evaluation. Moreover, the algorithms generate on average between 1 and 8 resolutions per specification, which may allow the engineer to manually inspect and select the most appropriate resolutions. We also observed that the genetic algorithms (NSGA-III and WBGA) outperform AMOSA and Unguided Search in terms of several quality indicators: number of (non-dominated) resolutions and standard quality indicators (HV and IGD) for multi-objective algorithms.

Acknowledgements. This work is supported by the Luxembourg National Research Funds (FNR) through the CORE project grant C19/IS/13646587/RASoRS.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# A Modeling Concept for Formal Verification of OS-Based Compositional Software

Leandro Batista Ribeiro1() , Florian Lorber<sup>2</sup> , Ulrik Nyman<sup>2</sup> , Kim Guldstrand Larsen<sup>2</sup> , and Marcel Baunach<sup>1</sup>

> <sup>1</sup> Graz University of Technology, Graz, Austria {lbatistaribeiro,baunach}@tugraz.at <sup>2</sup> Aalborg University, Aalborg, Denmark {florber,ulrik,kgl}@cs.aau.dk

Abstract. The use of formal methods to prove the correctness of compositional embedded systems is increasingly important. However, the required models and algorithms can induce an enormous complexity. Our approach divides the formal system model into layers and these in turn into modules with defined interfaces, so that reduced formal models can be created for the verification of concrete functional and non-functional requirements. In this work, we use Uppaal to (1) model an RTOS kernel in a modular way and formally specify its internal requirements, (2) model abstract tasks that trigger all kernel functionalities in all combinations or scenarios, and (3) verify the resulting system with regard to task synchronization, resource management, and timing. The result is a fully verified model of the operating system layer that can henceforth serve as a dependable foundation for verifying compositional applications w.r.t. various aspects, such as timing or liveness.

Keywords: Embedded Systems · Real-Time Operating Systems · Formal Methods · Uppaal · Software Composition.

# Availability of Artifacts

All Uppaal models and queries are available at https://doi.org/10.6084/ m9.figshare.21809403. Throughout the paper, model details are omitted for the sake of readability or due to space constraints. In such cases, the symbol 6 indicates that details can be found in the provided artifacts.

# 1 Introduction

Embedded systems are everywhere, from simple consumer electronics (wearables, home automation, etc.) to complex safety-critical devices. e.g., in the automotive, aerospace, medical, and nuclear domains. While bugs on non-critical devices are at most inconvenient, errors on safety-critical systems can lead to catastrophic consequences, with severe financial or even human losses [19,21]. Therefore, it is of utmost importance to guarantee dependable operation for safety-critical systems at all times. Common practice in industry to validate safety-critical systems is still extensive testing [4]. However, this approach only proves the absence of errors in known cases, but it cannot prove general system correctness.

While general correctness can be proven with formal methods, they still face resistance from practitioners [24], as they are considered resource-intensive and difficult to integrate into existing development processes [14]. However, potential cost reduction or strict regulations might contribute to their adoption. For example, the use of formal methods can facilitate the acceptance of medical devices by regulatory agencies [13], and is already prescribed as part of future development processes in some domains [30,31].

The software running in embedded devices is commonly composed of applications running on top of an Operating System (OS). Throughout the device life cycle, there are usually many more updates on the application than on the OS. Moreover, the application software is tailored for specific needs, while the OS is a foundation that diverse applications can use. Therefore, it is highly desirable to have a formally verified OS, which does not need to be re-verified when applications are modified. The complete formal verification of software involves the creation of models and their verification. Furthermore, all transition steps from models to machine code must be verified.

In this paper, we focus on the modeling stage by using the model-checking tool Uppaal [23] to model typical features and functionality of modern real-time operating systems and to formally specify requirements to verify the model. Once the OS model is proven correct, it can be used by OS-based software models and reduce the verification effort, since OS requirements do not need to be re-verified.

Our contributions in this paper are (1) an approach that allows the modularization of formal models with defined interfaces, so that these can be assembled as models of the overall system; (2) based on this, guidelines to create a self-contained OS model that facilitates the creation of application models, which can be combined to verify various aspects of the overall software; (3) a concept for creating abstract task models to verify the OS model against the specified requirements.

As a proof of concept and to evaluate our approach in terms of performance and scalability, we formally model typical syscalls that represent the kernel interface towards the higher software levels. We then verify the modeled kernel features under all conceivable situations. For this, we create models that abstract the full software stack, and then verify timing, task synchronization, and resource management with feasible resource expense. The result is a formally verified OS model that can henceforth be used as a foundation for the modeling and verification of complex OS-based applications.

In this paper, we do not address the correctness of concrete OS implementations or the completeness of specified requirements, i.e., this paper does not aim to prove the correctness of the code-to-model translation, or that all require-


Table 1. Common task states on RTOSes.

ments are specified. Still, the provided models and requirements are sufficient to demonstrate the proposed concept.

The remainder of this paper is organized as follows: in Section 2 we present relevant concepts for our proposed approach. In Section 3 we describe our approach to model the software layers modularly. In Section 4 we introduce abstract tasks and discuss the verification of OS requirements. In Section 5, we analyze and evaluate the proposed concept. In Section 6 we present related work. Finally, Section 7 summarizes this paper and shows potential future work.

# 2 Background

### 2.1 Real-Time Operating System (RTOS)

Complex OSes quickly lead to state explosion when model-checking. Therefore, we focus on a small common set of features of modern RTOSes that enables realtime behavior, namely preemptive multitasking, priority-driven scheduling, task synchronization, resource management, and time management. Priority inheritance protocols are not addressed in this paper, because they are not necessary to demonstrate our proposed concept. However, they can be integrated by modifying the related syscalls.

Tasks are the basic execution unit of RTOS-based software. They run in user mode and have fewer privileges than the kernel, which runs in kernel mode. Tasks have individual priorities and execute concurrently, and interact with the OS via syscalls. Tasks can be in one of the four states shown in Table 1. Specific implementations might not contain all states. For example, in this paper we model tasks as infinite loops, which never terminate. Thus, they have no suspended state. RTOSes commonly contain an idle task, which runs when no other task is in the ready state.

The Kernel is responsible for providing services to tasks and for interacting with the hardware. It initializes the system on startup and switches between tasks at runtime. Kernel execution can be triggered by tasks or interrupts through a fixed interface only.

Syscalls and Interrupt Service Routines (ISRs) are special functions that are exclusively provided by the kernel and define its interface. While user mode software can only interact with the OS through syscalls, ISRs can only be triggered by the hardware. The modeled syscalls and ISR are covered in Section 3. Time Management is an important feature of RTOSes.The kernel (1) maintains an internal timeline to which all tasks can relate, and (2) allows tasks to specify timing requirements.

Fig. 1. A general Uppaal timed automaton template.

Events can be used for inter-task communication and to react on interrupts. They provide a unified synchronization mechanism across hardware and software, in which tasks can signal each other, and interrupts can trigger tasks.

Resources coordinate the access of tasks to exclusively shared components, like hardware (e.g., I/O peripherals) or virtual entities (e.g., data structures). They can be requested from the OS and are assigned depending on availability and the priority of waiting tasks.

The Scheduler is responsible for coordinating the interleaving of tasks according to one or more predefined policies, such as fixed-priority, Rate-Monotonic Scheduling (RMS), and Earliest Deadline First (EDF).

### 2.2 Uppaal

For modeling and verification, we choose the model-checking tool Uppaal [23], in which systems are formalized as a network of timed automata with additional functions and data structures that are executed and changed on edges. Since we model preemptive tasks, we use Uppaal 4.1, which supports stopwatch automata[10] and enables the elegant modeling of preemption. While a formal definition of timed automata is provided in [7], we still describe the features relevant for this work. Examples in this section refer to Fig. 1.

Timed automata are composed of (labeled) locations and edges. In Uppaal, timed automata are specified with the concept of templates, which are similar to classes in object-oriented programming. For the verification, the templates are instantiated into processes (analogous to objects). All instantiated processes execute concurrently in a Uppaal model. However, they can still be modeled in a fashion that executes them sequentially, which we adopted in our models.

Locations. Standard locations are represented by a circle (L2\_NAME). The initial location (L1\_NAME) is represented by a double circle. Committed locations (L3\_NAME) have a letter "C" within the circle, and they are used to connect multi-step atomic operations. Different from standard locations, time does not pass while any automata are in a committed location. Locations can have names and invariants. Location names can be used in requirement specifications, and ease the readability of automata. A location invariant (e.g., \_clk<100) is an expression that must hold while the automaton is in that corresponding location. Edges connect locations in a directional manner. Edge transitions are instantaneous, i.e., they introduce zero time overhead. Edges can have a select statement(selectVar : Range), a guard (\_guard()), a synchronization (\_synch!), and an update operation (\_update()). A select statement non-deterministically chooses a value from a range of options and assigns it to a variable. A guard controls whether or not its edge is enabled. An update operation is a sequence of 30 L. Batista Ribeiro et al.

```
1 typedef int [5 , 10] from5to10_t ;
2
3 const from5to10_t VALID = 10;
4 from5to10_t invalid = 4; // verification failure
5
6 typedef struct { from5to10_t var1 ;} newStruct_t ;
             Listing 1.1. Bounded types and data structures in Uppaal.
```
expressions to be executed. Finally, processes can synchronize and communicate via channels.

Communication channels (\_synch) allow processes to send output (\_synch!) or listen for input (\_synch?). Uppaal supports handshake and broadcast communication. When a synchronizing transition is triggered, both the sender and the listener(s) move to the next location simultaneously, assuming their guards allow for the transition to be taken. The update operation happens first at the sender side, allowing the sender to communicate numeric values via shared variables. In our approach, this is used to pass function/syscall parameters and return values between model modules.

Time is modeled with clock variables (\_clk). The timing behavior is controlled with clock constraint expressions in invariants and guards. For example, the invariant \_clk < 100 and the guard \_clk >= 50 indicate that the transition from L2\_NAME to L3\_NAME happens when \_clk is in the interval [50, 100). In general, all clock variables progress continuously and synchronously. However, the stopwatch feature of Uppaal 4.1 provides a way to stop one or more clocks in any location, namely by setting the clock derivative to zero (\_clk' == 0). When the derivative is not written in the location invariant, its default value (1) is used and the clock progresses normally. For our system models, stopwatches are used to measure and verify the execution time of preemptive tasks. A task's clock progresses only if the task is in the running state, otherwise it is stopped.

Functions, data structures and bounded data types are defined in Uppaal in a C-like language. Bounded types are very convenient for detecting unwanted values during the verification, which is immediately aborted in case a variable is assigned a value outside its type range. The syntax is exemplified in Listing 1.1.

Formal verification. Uppaal performs symbolic model-checking to exhaustively verify the specified system requirements. The Uppaal specification language allows expressing liveness, safety, and reachability properties.

An important operator offered by Uppaal is "−− >" (leads to): <sup>p</sup> −− > <sup>q</sup> means that whenever p holds, q shall also eventually hold. This notation is particularly useful to detect task starvation: if a task in the ready state does not lead to its running state, it is starved. A deadlock in the Uppaal verification query language is used to detect system states that are not able to progress, i.e., states of the model in which no edges are enabled. Throughout this paper, such situations are referred to as Uppaal deadlock. It must not be confused with deadlock, which refers only to task deadlocks due to cyclic waiting on resources.

Fig. 4. Kernel interface template. Syscalls highlighted 6.

### 3 Model Design

time.

In this section, we propose a general modular approach to model OSes and (abstractions of) application tasks. Our overall goal is to formally prove that a system meets all (non-)functional requirements, which we divide into OS-internal and overall software composition requirements. The characteristics of each category are described in Section 4.

We logically divide the Uppaal model into three layers, as shown in Fig. 2. The application<sup>3</sup> contains tasks that run in user mode and can use OS services through syscalls. The kernel interface is responsible for switching between user and kernel mode, and to invoke the appropriate OS services or functionality upon syscalls or interrupts.

In this paper, we primarily focus on the operating system layer and how to model it with the goal to simplify the later modeling of the application layer. The result is a strict layering of the overall software model, where modules above the OS layer can be added, removed or updated without re-verifying the OS itself.

To demonstrate the applicability of our approach, we create an OS model 6 (composed of sub-models) based on common features of modern RTOSes: preemptive multitasking, priority-driven scheduling, and syscalls for task synchronization, resource management, and time management. The modeling techniques are generic, and any concrete OS can be similarly modeled.

#### 3.1 Naming Convention

For readability, there is a naming convention for communication channels and variables throughout the entire model: Channels starting with an underscore

<sup>3</sup> For this paper, user libraries and middleware services are abstracted into the application layer and are not discussed separately.

(e.g., \_proceed! in Fig. 4) represent internal kernel communication or are used for interrupt handling. Similarly, variables starting with an underscore represent internal kernel data structures. As for real code, the application layer must not directly access such OS-internal functions or variables. Channels and variables that can be accessed by the application layer as part of the OS interface start with a letter (e.g., sleep? in Fig. 4). Unfortunately, Uppaal does not support such scope separation and the naming convention is used only as visual aid.

### 3.2 The Kernel Interface

The kernel interface must offer all possibilities to switch from user to kernel mode, modeled with communication channels. Triggering such channels from automata in the application layer represents a syscall in the real code.

Fig. 4 depicts our modeled kernel interface. A context switch (\_kernelEntry!) occurs either upon syscalls, if the parameters are valid (valid 6), or upon a timer interrupt (\_timerInt). Supporting more interrupts (or syscalls) can be achieved by adding their corresponding automata, and respective edges into the kernel interface.

Kernel Execution and Kernel Overhead. Our modeling approach can precisely reflect the runtime overhead introduced in a preemptive system by the OS kernel itself. This allows a more accurate verification of the behavior of embedded systems compared to approaches that abstract away the OS layer. While different types of OS overhead can be modeled, we initially focus on timing.

Therefore, the kernel interface in Fig. 4 triggers a separate automaton for the kernel timing (execute[KERNEL]!), as shown in Fig. 3. The execution time interval [bcet, wcet] contains the time required to enter the kernel, process the invoked syscall or ISR, execute further kernel functions (e.g., the scheduler), and exit the kernel. This concentrated timing computation is possible because the kernel executes atomically (in contrast to the preemptive tasks).

Next, after taking kernel timing into consideration (execDone[KERNEL]?), we trigger the automata for the functional part of the actual syscall or ISR. The variable sid in \_syscall[sid]! is updated along the syscall edges 6 and identifies the ID of the invoked syscall. The same approach can be used for modeling multiple interrupts.

#### 3.3 The Operating System

The OS model must contain the internal data structures as well as the Uppaal templates for the scheduler and for all syscalls. For this paper, we created the OS model based on the SmartOS [28] implementation.

Data Structures and Tight Bounds. We must declare all OS variables and arrays with data types of the tightest possible boundaries, according to the system parameters. Listing 1.2 shows a few examples from our OS model.

A beneficial consequence is a strict verification that does not tolerate any value out of range. In such cases, the verification immediately fails and aborts.

```
1 // 1 - System Parameters
2 const int NTASKS , NEVENTS , NRESOURCES , MMGR ;
3 // 2 - Type Definitions
4 typedef struct {
5 int [0 , NTASKS ] qCtr ; // the number of tasks in ready queue
6 ExtTaskId_t readyQ [ NTASKS ]; // the ready queue containing all tasks
7 // in ready state sorted by priority
8 } SCB_t ; // Scheduler Control Block
9 typedef int [0 , NTASKS - 1] TaskId_t ;
10 // 3 - Declaration of Control Blocks
11 TCB_t _TCB [ NTASKS ]; // Task CBs
12 RCB_t _Res [ NRESOURCES ]; // Resource CBs
13 SCB_t _sched ; // Scheduler CB
```
In other words, if the verification finishes, there is a guarantee that no boundary violation has occurred.

The Scheduler must be the only part of the OS model allowed to manipulate the ready queue (see Listing 1.2) and dispatch Ready tasks for execution.

Before the first task is dispatched, the system must be fully initialized. To ensure this, we must use a single initial committed location, from which an initializing edge transition occurs. Fig. 5 shows this behavior on the scheduler. The function startOS() initializes all the internal data structures of the OS. Next, because the following location is also committed, the scheduler immediately dispatches the highest priority Ready task, and switches to user mode (uppermost edge). The scheduler then must wait for instructions (\_proceed?, \_schedule?, etc.), which are issued by syscalls or ISRs, and must adapt the ready queue accordingly 6.

Syscalls. Each syscall must have a dedicated Uppaal template, which models its semantics, i.e., the manipulation of related OS data structures, and interactions with the scheduler. Syscalls can be triggered (1) from the kernel interface (\_syscall[sid]!) or (2) from other syscalls. Their general structure is an initial non-committed location, followed by a sequence of transitions through committed locations, making the syscall execution atomic, as shown in Fig. 6.

Task slices. While syscall automata model the behavior of the OS, task slices model different aspects of task execution, as shown in Fig. 7. They can directly communicate with task models (e.g., in Fig. 7(c), start/end a real-time block), or progress upon kernel operations (e.g., in Fig. 7(d), state change upon scheduler actions). The latter is completely transparent to task models. The use of task slices facilitates the modeling of tasks (Section 3.4) and the formal specification and verification of requirements (Section 4).

Fig. 5. The priority-driven scheduler 6.

Fig. 6. The releaseResource syscall model 6.

Fig. 7. Modeled task slices 6: (a) Task Execution, (b) Task Timeout, (c) Task Real-Time, (d) Task States.

Task Execution Time. This task slice represents the user-space execution time of (code blocks within) a task. It abstracts away the code functionality, but allows the modeling of a [bcet, wcet] range. While the specification of the range itself is shown in Section 3.4, the helper template is shown in Fig. 7(a). Its structure is similar to the kernel execution time template in Fig. 3. However, we cannot assure that the execution of code in user mode is atomic, and must therefore consider preemption: If a \_kernelEntry! occurs while a task is in the Executing location, it goes to Preempted, where the task execution is paused, i.e., the execution time clock et is paused (et'==0).

Task Timeout. This task slice is responsible for handling timeouts of syscalls (e.g., sleep), and thus it must trigger timer interrupts. Our version is depicted in Fig. 7(b)<sup>4</sup> . The clock c is used to keep track of elapsed time. The location Waiting can be left in two different ways: either the timeout expires (edge with c==timeout), or the task receives the requested resource/event (edge with \_schedule?) before the timeout. If c==timeout, a timer interrupt is generated (\_timerInt!) if the system is not in kernel mode. Otherwise, we directly proceed to the next location, where we wait for a signal from the scheduler (\_wakeNext?) indicating that the task can be scheduled again. Finally, we instruct the scheduler to insert the current task into the ready queue with \_schedule!.

<sup>4</sup> In our model, all syscalls with a timeout internally use \_sleep[id] 6. Other approaches might require multiple outgoing edges from the initial state.

Task Real-Time. This task slice is used to verify real-time behavior, as it can detect deadline violations. This task slice acts as an observer of the response times during verification, and has no influence on OS data structures or locations.

As shown in Fig. 7(c), there is a local clock rt, which is used to compute the response time of a code sequence. It remains paused unless startRealTime[id]? is triggered by the corresponding task. This happens in the task model (as shown in Section 3.4) and indicates that the task is about to start the execution of a code sequence with timing constraints. rt then progresses until the task triggers endRealTime[id]?. If this happens before the deadline is reached, the process returns to its initial state and is ready for another real-time block. Otherwise, the system goes to the DLViolation error state. The self-loop in the error state is used to avoid a Uppaal deadlock<sup>5</sup> .

Task States. This task slice allows the detection of task starvation. A task starves if it never runs (again). A special case of starvation is task deadlock, which can be detected by additionally analyzing the OS internal data structures and identifying cyclic waiting on resources. Fig. 7(d) shows the modeled task states (as locations) and the actions that trigger state transitions.

The use of task slices is an extensible modeling concept: Extra task slices can be added to enable the verification of other (non-)functional requirements, e.g., energy/memory consumption.

#### 3.4 Simple Application Modeling

The OS model, kernel interface, and task slices are designed with a common goal: Simplify the modeling of application tasks and make the overall system verification more efficient. With our concept, task models just need to use the provided interfaces (channels) and pass the desired parameters.

In summary, a task can be modeled with three simple patterns, as exemplified in Fig. 8:

➊ syscalls: invocation by triggering the corresponding channel, then waiting for dispatch[id]? (from the scheduler),

➋ execution of regular user code between execute[id]! and execDone[id]? (from Task Execution Time task slice),

➌ specification of real-time blocks between startRealTime! and endRealTime!.

As an example, Fig. 8 models the task source code from Listing 1.3 as a Uppaal task. The variables p1 and p2 are used to pass data between different processes, e.g., for syscall parameters.

For ➊ and ➋, the use of the guard amIRunning(id) is crucial for the correct behavior of the task. It allows a task to proceed only if it is Running. The absence of this guard would allow any task to execute, regardless of priorities or task states.

For ➌, this guard is not necessary when starting or ending real-time blocks, though. If a task reaches the beginning of a real-time block, the response time

<sup>5</sup> In our approach, an Uppaal deadlock indicates a modeling mistake.

Fig. 8. Uppaal model of the code from Listing 1.3.

```
1 OS_TASKENTRY ( taskSort ){
2 while (1) {
3 ➊ waitEvent ( evSort );
4 ➌ // START : Real - Time Task block . Deadline =400
5 ➌➋ quickSort ( buffer , BUFSIZE ); // Execution Block : BCET =20 , WCET =50
6 ➌ // END : Real - Time Task block
7 ➋ for (...) printf ("\n\%u", buffer [i ]) ; // Execution Block : BCET = WCET =20
8 ➊ setEvent ( evSorted ) ;
9 }
10 }
```
Listing 1.3. Source code of a task.

computation must be immediately started, even if the task is preempted. Similarly, after the execution of a real-time block, the response time computation must be stopped immediately.

# 4 Requirements and Verification

### 4.1 Composition Requirements

These requirements refer to task properties that are influenced by other tasks running in the system, such as freedom from starvation and from deadline violations 6.

If a composition requirement is violated, the underlying cause is usually a badly composed or implemented task set, which makes it impossible for all tasks to coexist. However, it is also possible that an error in the OS leads to a violation of the composition requirements. In order to exclude this second possibility when verifying the complete system model, we must formally verify the OS model first.

### 4.2 OS Requirements

The OS requirements refer to OS properties that must always hold (invariants), regardless of the number of tasks in the system or of how these tasks interact with the OS (or with each other through the OS). As described in Section 3.3, the OS model is composed of data structures and multiple Uppaal templates, which must be consistent at all time (general requirement). For example, if a task is in the Waiting location in the task timeout task slice, it must also be in the Waiting location in the task states task slice. In Uppaal, we can verify this requirement with the query:

A[] **forall** (Tasks) TaskTimeout.Waiting imply TaskStates.Waiting 6

This example shows an important point when extending our concept: Whenever new task slices are added to verify other (non-)functional requirements of the application, additional OS requirements must be specified to verify the consistency of the new task slice with pre-existing parts of the OS model.

#### 4.3 Verifying the Requirements

For a given software (i.e., OS and application), we can prove correctness w.r.t. the OS and composition requirements by verifying all associated queries. However, we cannot yet claim that the OS model is correct in general (i.e., independent from the task composition), because we do not know if all possible OS operations were considered in all possible scenarios during the verification. Therefore, a complete re-verification of both layers is required in case the application changes.

To avoid the repeated and resource-expensive re-verification of the OS requirements for each task set, we must prove that the OS model is correct in general. We can then limit the re-verification to the application layer. To achieve this goal, we need to make sure that all possible OS operations are verified in all possible scenarios and execution orders. One possible strategy is to create different task sets to reach different scenarios, similar to test case generation. However, this strategy requires the prior identification of relevant scenarios, and the creation of the corresponding task sets. Additionally, it is hard to guarantee that all scenarios were indeed identified. Therefore, we introduce a new concept that inherently covers all scenarios: abstract tasks. They unite all possible behaviors of concrete tasks, i.e., they can trigger any action at any time. A task set with N abstract tasks thus represents the behavior of all possible task sets with N (concrete) tasks. Thus, by definition, all possible scenarios will be reached (Uppaal exhaustive approach).

Abstract Tasks. Real tasks, as exemplified in Listing 1.3, are strictly sequential. Thus, a (concrete) task model is a predefined sequence of steps, as discussed in Section 3.4, and shown in Fig. 8. Their key characteristic is that only one outgoing edge is enabled in any location at any point in time.

The abstract task is depicted in Fig. 9. Unlike a concrete task, it has multiple outgoing edges enabled, which open all possible options to progress: ➊ syscalls with valid parameters and ➋ user code execution (execute[id]!). Thus, the behavior of any concrete task can also be achieved with the abstract task.

While different actions are performed by taking different edges, the parameters are non-deterministicaly chosen in the select statements for each syscall. The Uppaal state space exploration mechanisms guarantee that all values of the select statements are considered for each edge.

Select statements are not necessary for the timing parameters EX\_TIME and SL\_TIME. Fixed values have less impact on the state space, and are enough to fire all edges from the task execution and task timeout (Fig. 7(a) and Fig. 7(b), respectively). We define the timing parameters 6 in a way that all edges are eventually fired and the state space remains small enough for a feasible verification.

Fig. 9. The abstract task model 6.

Non-Goals of Verification with Abstract Tasks. With abstract tasks, it is meaningless to verify if composition requirements are satisfied at task level. Abstract tasks – by definition – lead to states where composition requirements are violated<sup>6</sup> . The goal of abstract tasks is to ensure that the OS itself works correctly even if the task composition is flawed, e.g., if it leads to starvation or livelocks. This is achieved by verifying the OS requirements in all conceivable scenarios (in the end of Section 4.4, we show how to verify that flawed composition scenarios are also reached). Additionally, we do not explore invalid values of variables./parameters. Out-of-bound values lead to verification failure, and when invalid syscall parameters are detected in the kernel interface, no functionality is triggered in the OS. Thus, checking for invalid values would increase the state space without adding new behaviors.

### 4.4 OS Model Verification

A single set of abstract tasks provides a reliable way of verifying scenarios that could otherwise only be reached with numerous concrete task sets. To fully verify the OS model, we must compose the abstract task set so that it triggers all OS operations in all possible scenarios (covering all corner cases).

Within our model, we can control four system parameters that affect the OS verification: NTASKS, NEVENTS, NRESOURCES, and MMGR<sup>7</sup> , cf. Listing 1.2. We use a short notation to represent the system configuration. For example, 5-3-4-2 represents a configuration with NTASKS = 5 (idle task + 4 others), NEVENTS = 3, NRESOURCES = 4, and MMGR = 2. The goal is to find the minimal configuration that reaches all possible scenarios, and thus allows the complete verification of the OS model with minimal verification effort.

<sup>6</sup> Unless the OS offers guarantees by design, e.g., if it implements the Highest Locker Protocol (HLP), task deadlock scenarios must not be reachable.

<sup>7</sup> Maximum multiple getResource, i.e., the upper limit of the resource counter.

Model Coverage. In order to cover the whole model, the verification must traverse all edges, and entirely cover the C-like code of update operations.

Edge Coverage. If there is at least one edge in the model that is not traversed during verification, the model is surely not fully verified; unreachable edges could also indicate design flaws in the model. Therefore, the first step of the verification addresses the edge coverage. We add boolean markers in strategic edges, which are set to true when the corresponding edge is taken. We then verify if all markers are ever true:

### E<> **forall** (i : int [0, NEDGES-1]) edge[i]==**true**

Edge Scenarios. A single edge can be traversed in multiple scenarios, due to composite guards (with the pattern (A or B or C ...)) or update operations (parameter passing or functions). For the composite guards, we must verify that each of its components is reachable with queries with the following pattern 6:

### E<> Location and A

For the update operations, we ensure that an edge is traversed with all possible parameter values via select statements, which cover all valid parameter values. The functions demand a more careful analysis. It is necessary to identify all corner cases, and verify their reachability. For example, to verify the corner cases of a list insertion, we can use the following queries:

### E<> InsertLocation and firstPosInsertion E<> InsertLocation and lastPosInsertion E<> InsertLocation and intermediatePosInsertion

After an iterative process of increasing the configuration and verifying the aforementioned properties, we found the smallest configuration that entirely covers our OS model: 4-1-1-2.

OS and Composition Requirements. The goal of the verification of the OS model is to guarantee that all OS requirements are met. In conjunction with the full model coverage verification, we prove that they are met regardless of the operations performed by individual tasks on top of the OS.

However, to ensure that the OS model is correct, we still must prove that the OS requirements are also met in states where composition requirements are violated. For that, we must identify all situations that violate composition requirements, and verify their reachability. For example, the reachability of a deadlock scenario can be verified with the query 6:

### E<> Res1.owner == Task1 and Res2.owner == Task2 and Task1.waits == Res2 and Task2.waits == Res1

The deadlock scenario reveals that 4-1-1-2 is not sufficient to reach all composition scenarios, since at least two resources are required to cause it. For the modeled OS features, all composition scenarios are reachable with 4-1-2-2.



# 5 Analysis and Evaluation

So far, we verified 4-1-2-2<sup>8</sup> , and confirmed that it satisfies all specified OS requirements and the necessary aspects discussed in Section 4: (1) traverse all model edges at least once; (2) invoke syscalls with all possible parameters; (3) reach all corner cases of edge update operations; (4) satisfy all of the components of composite guards; (5) reach valid and invalid composition scenarios; In this section, we analyze how the minimal configuration is obtained in the general case, and the scalability of the approach. We then reason why bigger configurations are not necessary for the verification.

### 5.1 Compositional Approach to Deriving the Minimal Configuration

The verification of the OS model is essentially the verification of its set of supported features. Thus, the composition of all minimal configurations needed to verify individual features is used to verify properties of the entire OS.

We assume that feature developers/experts provide the minimal configuration based on the corner cases and composition scenarios of their feature. We then build the minimal configuration by using the highest value of each parameter of each analyzed feature, as described in Algorithm 1. For example, the dominating features<sup>9</sup> in our OS model are resource management (3-0-2-2) and event passing (4-1-0-0), which lead to the resulting configuration 4-1-2-2.

### 5.2 Scalability: Resource Consumption for Verification

First, we show a concrete analysis of our approach, namely the number of explored states, CPU Time, and memory consumption during verification. Additionally, we show how each system parameter influences these values.

The verification was performed with UPPAAL 4.1.26 x64, running on a machine with Ubuntu 18.04.5 LTS, a 16 core Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 64GB DDR4 memory @ 1600MHz, and 8GB swap.

State Space. In order to explore all states with a low processing overhead, we verify the query "A[] **true**". Fig. 10 and Table 2 show the number of explored states with different system configurations. The leftmost point (Delta = 0) in

<sup>8</sup> see Section 4.4 for configuration notation.

<sup>9</sup> No other feature has higher parameter values.


Table 2. Verification time (minutes) and memory consumption (MB).

Fig. 10. Verification overhead for different configurations.

Fig. 10 represents our proposed minimal system configuration 4-1-2-2. We then vary one of the parameters, while all others are constant. For example, the "Varying Events" line on Delta = 1 shows the number of states for 4-2-2-2; and the "Varying Res. Ctr." line on Delta = 2 the number of states for 4-1-2-4.

The curves from Fig. 10 show that NTASKS has the biggest impact in the state space, and that MMGR has the lowest. While MMGR affects only the upper bound of the resource counters, NTASKS affects all kernel data structures, since each task can call any of the syscalls, which drive the modifications on the kernel data structures. In fact, the verification of 6-1-2-2 did not finish. It required more than 72GB of RAM, and the process was killed by Linux. Until short before, we could already count 950 million explored states.

It is important to highlight that the scalability is much better when simple concrete tasks are modeled. To demonstrate it, we modeled a concrete task set with sequential execution (without preemption) and used the configuration C-(51-50-2-2) 6, where C- indicates it is a configuration for a concrete task set. Table 2 shows that verifying "A[]**true**" explored only 574,266 states. Additionally, ongoing research on reducing the state-space, like for instance with partial-order reduction [22], will enable the verification of ever larger systems.

Memory consumption and CPU time. For the tested configurations, memory and CPU time follow a pattern similar to the number of explored states (Fig. 10). However, the number of states is not the only factor influencing resource consumption. The verification of C-(51-50-2-2) took longer and used more memory than the verification of 4-1-4-2, even though the state space is almost 10 times smaller (see Table 2). The size of individual states also plays an important role, because they are stored/read into/from memory during the verification. In our OS model, NTASKS, NEVENTS, and NRESOURCES contribute to the state size, since bigger values increase the size/amount of data structures.

### 5.3 Sufficiency of 4-1-2-2 Configuration for our OS Model

We cannot run the verification of the OS model with arbitrarily big system configurations, due to the state space explosion problem. Therefore, we reason that, despite creating a larger state space, bigger configurations do not create any new scenarios in the OS layer.

As discussed in Section 3.3, the bounds of all data types are as tight as possible, and are defined according to the system parameters. Thus, when a parameter is increased, the bounds of the variables are adapted accordingly, avoiding out-of-bounds errors.

Since the bounds of data types and arrays are already covered by design, we just need to assure that no extra corner cases arise on queue operations.

More abstract tasks. With more tasks, the capacity of OS internal queues increases. Thus, there are more positions in which a new element can be inserted. However, these new possibilities do not add any new corner cases.

More events or resources. More events or resources lead to more queues in the system, but do not change the capacity of the queues. Thus, these parameters do not affect queue operations w.r.t. verification.

Higher limit for counting resources. When a task T (that already owns a resource R) requests R once again, R's internal counter is incremented. Still, a higher limit does not create new corner cases w.r.t. verification.

Composition Scenarios. Bigger system configurations do not create new scenarios, but only new settings for the existing ones, e.g., starvation of different tasks, or deadlocks involving different sets of tasks and resources.

# 6 Related Work

Similar to our approach, with the goal to verify compositional requirements, Ironclad [18] covers the full software stack. It uses Dafny [25] and Boogie [6] to verify assembly code, but it addresses only security requirements. Borda et al. [8] propose a language to model self-adaptive cyber-physical systems modularly and a technique to support compositional verification. However, timing requirements are not addressed. Giese et al.[12] address compositional verification of real-time systems modeled in UML. Components are verified in isolation, and the correctness of the system is derived by ensuring that the composition is syntactically correct. However, this is only possible if the components do not share resources. Uppaal has been used for schedulability analysis of compositional avionic software [17], and for conformance testing with requirements specified as pre- and post-condition functions [29].

Regarding modeling and verification of OSes, on a more abstract level, Alkhammash et al.[5] propose guidelines for modeling FreeRTOS[1] using Event-B[3]. Cheng et al. formally specify the behavior of FreeRTOS tasks [11] and verify

it using the Z/Eves theorem prover[26], but, unlike our approach, they do not address timing, resource sharing, or interrupts.

On a less abstract level, closer to the real implementation, seL4 [20] proves the functional correctness of the C code of the kernel. Furthermore, it guarantees that the binary code correctly reflects the semantics of the C code. Hyperkernel [27] formally verifies the functional correctness of syscalls, exceptions and interrupts. The verification is performed at the LLVM intermediate representation level [32] using the Z3 SMT solver[9]. CertikOS[16] is the first work that formally verifies a concurrent OS kernel. They use the Coq proof assistant[2], a C-like programming language, and a verified compiler [15]. These approaches focus exclusively on the functional correctness of the OS kernel.

We have not found a work that can verify timing, resource sharing, task synchronization, and interrupts in a compositional context. That is what our work enables, after proving the correctness of the OS model.

### 7 Conclusions and Future Work

In this paper, we presented a Uppaal modeling approach for verifying compositional software, exemplified with an OS model containing a common set of features present in modern RTOSes. Since the proposed techniques and patterns are general, they can be used to model any concrete OS. We showed how to model the OS aiming to simplify the modeling of application tasks (Section 3). We also introduced separate OS requirements and composition requirements, and showed how they can be formally specified (Section 4) to decouple the verification of the OS and the application layer. We then proposed the concept of abstract tasks (Section 4.3) and reasoned that the OS model can be fully verified with a minimal set of such tasks, which interact through OS primitives (e.g., events and shared resources) and thus trigger all OS functions in all possible scenarios (Section 4.4). Finally, we evaluated the resource consumption of the verification process, reasoned about the sufficiency of the used minimal configuration, and analyzed the benefits of the proposed concept (Section 5).

With the OS model proven correct, there is no need to re-verify it when the upper layers are modified, which saves time and resources on the verification of concrete task sets. We consider this as particularly beneficial for developing and maintaining highly dependable systems, where, e.g., the task composition and functionality may change during updates. Another benefit of our approach is the potential use on test case generation for the application software.

This work opens a variety of directions for future work. We currently work on task slices to verify further (non-)functional requirements. Besides, we continuously improve the model design for a better trade-off between abstraction level and verification overhead, including the avoidance of potential state space explosions. Tools to convert between source code and Uppaal templates shall reduce the modeling gap, i.e., the discrepancy between the formal model and the actual implementation. While our models allow the verification of applications on top of an OS, a limitation is that model correctness does not yet mean implementation correctness. For that, the full path from models to machine code must be verified.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Compositional Automata Learning of Synchronous Systems

Thomas Neele1() and Matteo Sammartino2,<sup>3</sup>

<sup>1</sup> Eindhoven University of Technology, Eindhoven, The Netherlands t.s.neele@tue.nl <sup>2</sup> Royal Holloway University of London, Egham, UK University College London, London, UK matteo.sammartino@rhul.ac.uk

Abstract. Automata learning is a technique to infer an automaton model of a black-box system via queries to the system. In recent years it has found widespread use both in industry and academia, as it enables formal verifcation when no model is available or it is too complex to create one manually. In this paper we consider the problem of learning the individual components of a black-box synchronous system, assuming we can only query the whole system. We introduce a compositional learning approach in which several learners cooperate, each aiming to learn one of the components. Our experiments show that, in many cases, our approach requires signifcantly fewer queries than a widely-used noncompositional algorithm such as L ∗ .

# 1 Introduction

Automata learning is a technique for inferring an automaton from a black-box system by interacting with it and observing its responses. It can be seen as a game in which a learner poses queries to a teacher – an abstraction of the target system – with the intent of inferring a model of the system. The learner can ask two types of queries: a membership query, asking if a given sequence of actions is allowed in the system; and an equivalence query, asking if a given model is correct. The teacher must provide a counter-example in case the model is incorrect. In practice, membership queries are implemented as tests on the system, and equivalence queries as conformance test suites.

The original algorithm L <sup>∗</sup> proposed by Dana Angluin in 1987 [3] allowed learning DFAs; since then it has been extended to a variety of richer automata models, including symbolic [5] and register [7,26] automata, automata for ωregular languages [4], and automata with fork-join parallelism [18], to mention recent work. Automata learning enables formal verifcation when no formal model is available and also reverse engineering of various systems. Automata learning has found wide application in both academia and industry. Examples are: the verifcation of neural networks [31], fnding bugs in specifc implementations of security [29,12] and network protocols [11], or refactoring legacy software [30].

In this paper we consider the case when the system to be learned consists of several concurrent components that interact in a synchronous way; the components themselves are not accessible, but their number and respective input alphabets are known. It is well-known that the composite state-space can grow exponentially with the number of components. If we use L ∗ to learn such a system as a whole, it will take a number of queries that is proportional to the whole statespace – many more than if we were able to apply L ∗ to the individual components. Since in practice queries are implemented as tests performed on the system (in the case of equivalence queries, exponentially many tests are required), learning the whole system may be impractical if tests take a non-negligible amount of time, e.g., if each test needs to be repeated to ensure accuracy of results or when each test requires physical interaction with a system.

In this work we introduce a compositional approach that is capable of learning models for the individual components, by interacting with an ordinary teacher for the whole system. This is achieved by translating queries on a single component to queries on the whole system and interpreting their results on the level of a single component. The fundamental challenge is that components are not independent: they interact synchronously, meaning that sequences of actions in the composite system are realised by the individual components performing their actions in a certain relative order. The implications are that: (i) the answer to some membership queries for a specifc component may be unknown if the correct sequence of interactions with other components has not yet been discovered; and (ii) counter-examples for the global system cannot univocally be decomposed into counter-examples for individual components, therefore some of them may result in spurious counter-examples that need to be corrected later.

To tackle these issues, we make the following contributions:


The rest of this paper is structured as follows. We introduce preliminary concepts and notation in Section 2. Our learning framework is presented in Section 3. Section 4 discusses the details of our implementation and the results of our experiments. Related work is highlighted in Section 5 and Section 6 concludes.

### 2 Preliminaries

Notation and terminology. We use Σ to denote a fnite alphabet of action symbols, and Σ<sup>∗</sup> to denote the set of fnite sequences of symbols in Σ, which we call traces; we use ϵ to denote the empty trace. Given two traces s1, s<sup>2</sup> ∈ Σ<sup>∗</sup> , we denote their concatenation by s<sup>1</sup> · s2; for two sets S1, S<sup>2</sup> ⊆ Σ<sup>∗</sup> , S<sup>1</sup> · S<sup>2</sup> denotes element-wise concatenation. Given s ∈ Σ<sup>∗</sup> , we denote by Pref (s) the set of prefxes of s, and by Suf (s) the set of its sufxes; the notation lifts to sets S ⊆ Σ<sup>∗</sup> as expected. We say that S ⊆ Σ<sup>∗</sup> is prefx-closed (resp. sufx-closed) whenever S = Pref (S) (resp. S = Suf (S)). The projection σ↾Σ′ of σ on an alphabet Σ′ ⊆ Σ is the sequence of symbols in σ that are also contained in Σ′ . Finally, given a set S, we write |S| for its cardinality.

In this work we represent the state-based behaviour of a system as a labelled transition system.

Defnition 1 (Labelled Transition System). A labelled transition system (LTS) is a four-tuple L = (S, →, s, Σˆ ), where


We say that L is deterministic whenever for each s ∈ S, a ∈ Σ there is at most one transition from s labelled by a.

Some actions in Σ may not be allowed from a given state. We say that an action a is enabled in s, written s <sup>a</sup>−→, if there is t such that s <sup>a</sup>−→ t. This notation is also extended to traces σ ∈ Σ<sup>∗</sup> , yielding s <sup>σ</sup>−→ t and s <sup>σ</sup>−→. The language of L is the set of traces enabled from the starting state, formally:

$$\mathcal{L}(L) = \{ \sigma \in \Sigma^\* \mid \hat{s} \xrightarrow{\sigma} \} \ .$$

From here on, we only consider deterministic LTSs. Note that this does not reduce the expressivity, in terms of the languages that can be encoded.

Remark 1. Languages of LTSs are always prefx-closed, because every prefx of an enabled trace is necessarily enabled. Prefx-closed languages are accepted by a special class of deterministic fnite automata (DFA), where all states are fnal except for a sink state, from which all transitions are self-loops. Our implementation (see Section 4) uses these models as underlying representation of LTSs.

We now introduce a notion of parallel composition of LTSs, which must synchronise on shared actions.

Defnition 2. Given n LTSs where L<sup>i</sup> = (S<sup>i</sup> , →<sup>i</sup> , sˆ<sup>i</sup> , Σi) for 1 ≤ i ≤ n, their parallel composition, notation ∥ n <sup>i</sup>=1 <sup>L</sup>i, is an LTS over the alphabet <sup>S</sup><sup>n</sup> <sup>i</sup>=1 Σi, defned as follows:


$$s\_i \xrightarrow{a}\_i t\_i \quad for \text{ all } i \text{ such that } a \in \Sigma\_i$$

$$\frac{s\_j = t\_j \quad for \text{ all } j \text{ such that } a \notin \Sigma\_j}{(s\_1, \dots, s\_n) \xrightarrow{a} (t\_1, \dots, t\_n)}$$

– the initial state is (ˆs1, . . . , sˆn).

Intuitively, a certain action a can be performed from (s1, . . . , sn) only if it can be performed by all component LTSs that have a in their alphabet; all other LTSs must stay idle. We say that an action a is local if there is exactly one i such that a ∈ Σ<sup>i</sup> , otherwise it is called synchronising. The parallel composition of LTSs thus forces individual LTSs to cooperate on synchronising actions; local actions can be performed independently. We typically refer to the LTSs that make up a composite LTS as components. Synchronisation of components corresponds to communication between components in real-world settings.

Example 1. Consider the left two LTSs below with the respective alphabets {a, c} and {b, c}. Their parallel composition is depicted on the right.

Here a and b are local actions, whereas c is synchronising. Note that, despite L<sup>1</sup> being able to perform c from its initial state s0, there is no c transition from (s0, t0), because c is not initially enabled in L2. First L<sup>2</sup> will have to perform b to reach t1, where c is enabled, which will allow L<sup>1</sup> ∥ L<sup>2</sup> to perform c. ⊓⊔

We sometimes also apply parallel composition to sets of traces: ∥<sup>i</sup> S<sup>i</sup> is equivalent to ∥ T<sup>i</sup> , where each T<sup>i</sup> is a tree-shaped LTS that accepts exactly S<sup>i</sup> , i.e., L(Ti) = S<sup>i</sup> . In such cases, we will explicitly mention the alphabet each T<sup>i</sup> is assigned. This notation furthermore applies to single traces: ∥<sup>i</sup> σ<sup>i</sup> = ∥i{σi}.

#### 2.1 L <sup>∗</sup> algorithm

We now recall the basic L <sup>∗</sup> algorithm. Although the algorithm targets DFAs, we will present it in terms of deterministic LTSs, which we use in this paper (these are a sub-class of DFAs, see Remark 1). The algorithm can be seen as a game in which a learner poses queries to a teacher about a target language L that only the teacher knows. The goal of the learner is to learn a minimal deterministic LTS with language L. In practical scenarios, the teacher is an abstraction of the target system we wish to learn a model of. The learner can ask two types of queries:

51

Fig. 1: A closed and consistent observation table and the LTS that can be constructed from it.


The learner organises the information received in response to queries in an observation table, which is a triple (S, E, T), consisting of a fnite, prefx-closed set S ⊆ Σ<sup>∗</sup> , a fnite, sufx-closed set E ⊆ Σ<sup>∗</sup> , and a function T : (S ∪ S · Σ) · E → {0, 1}. The function T can be seen as a table in which rows are labelled by traces in S ∪ S · Σ, columns by traces in E, and cells T(s · e) contain 1 if s · e ∈ L and 0 otherwise.

Example 2. Consider the prefx-closed language L over the alphabet Σ = {a, b} consisting of traces where a and b alternate, starting with a; for instance aba ∈ L but abb /∈ L. An observation table generated by a run of L ∗ targeting this language is shown in Figure 1a. ⊓⊔

Let row<sup>T</sup> : S∪S·Σ → (E → {0, 1}) denote the function row<sup>T</sup> (s)(e) = T(s·e) mapping each row of T to its content (we omit the subscript T when clear from the context). The crucial observation is that T approximates the Nerode congruence [28] for L as follows: s<sup>1</sup> and s<sup>2</sup> are in the same congruence class only if row(s1) = row(s2), for s1, s<sup>2</sup> ∈ S. Based on this fact, the learner can construct a hypothesis LTS from the table, in the same way the minimal DFA accepting a given language is built via its Nerode congruence:<sup>3</sup>

– the set of states is {row(s) | s ∈ S, row(s)(ϵ) = 1};

<sup>3</sup> For the minimal DFA, the set of states is {row(s) | s ∈ S}; here we only take accepting states as we are building an LTS.


In order for the transition relation to be well-defned, the table has to satisfy the following conditions:


Example 3. The table of Example 2 is closed and consistent. The corresponding hypothesis LTS, which is also the minimal LTS accepting L, is shown in Figure 1b. ⊓⊔

The algorithm works in an iterative fashion: starting from the empty table, where S and E only contain ϵ, the learner extends the table via membership queries until it is closed and consistent, at which point it builds a hypothesis and submits it to the teacher in an equivalence query. If a counter-example is received, it is incorporated in the observation table by adding its prefxes to S, and the updated table is again checked for closedness and consistency. The algorithm is guaranteed to eventually produce a hypothesis H such that L(H) = L, for which an equivalence query will be answered positively, causing the algorithm to terminate.

### 3 Learning Synchronous Components Compositionally

In this section, we show how to compositionally learn an unknown system M = M<sup>1</sup> ∥ · · · ∥ M<sup>n</sup> consisting of n parallel LTSs. To achieve this, we assume that we are given: (i) a teacher for M; and (ii) the respective alphabets Σ1, . . . , Σ<sup>n</sup> of M1, . . . , Mn. To achieve this, we propose the architecture in Figure 2. We have n leaners, which are instances of (an extension of) the L <sup>∗</sup> algorithm, one for each component M<sup>i</sup> . The instance L ∗ i can pose queries for M<sup>i</sup> to an adapter, which converts them to queries on M. The resulting yes/no answer (and possibly counter-example) is translated back to information about M<sup>i</sup> , which is returned to leaner L ∗ i . To achieve this, the adapter moreover choreographs the learners to some extent: before an equivalence query H ?= M can be sent to the teacher, the adapter must frst receive equivalence queries H<sup>i</sup> ?= M<sup>i</sup> from each learner.

We frst discuss the implementation of the adapter and show its limitations. To deal with these limitations, we next propose a couple of extensions to L ∗ (Section 3.2). Completeness claims are stated in Section 3.3. Several optimisations are discussed in Section 3.4.

Fig. 2: Architecture for learning LTS M consisting of components M<sup>1</sup> ∥ · · · ∥ Mn.

Fig. 3: Running example consisting of two LTSs L<sup>1</sup> and L<sup>2</sup> and their parallel composition L. The respective alphabets are {a, c}, {b, c} and {a, b, c}.

#### 3.1 Query Adapter

As sketched above, our adapter answers queries on each of the LTSs M<sup>i</sup> , based on information obtained from queries on M. However, the application of the parallel operator causes loss of information, as the following example illustrates. We will use the LTSs below as a running example throughout this section.

Example 4. Consider the LTSs L1, L<sup>2</sup> and L = L<sup>1</sup> ∥ L<sup>2</sup> depicted in Figure 3. Their alphabets are {a, c}, {b, c} and {a, b, c}, respectively.

Suppose we sent a membership query bc to the teacher and we receive as answer that bc /∈ L(L). At this point, we do not have sufcient information to deduce about the respective projections whether bc↾{a,c} = c /∈ L(L1) or bc↾{b,c} = bc /∈ L(L2) (or both). In this case, only the latter holds. Similarly, if a composite hypothesis H = H<sup>1</sup> ∥ H<sup>2</sup> is rejected with a negative counterexample ccc /∈ L(L), we cannot deduce whether this is because ccc /∈ L(L1) or ccc /∈ L(L2) (or both). Here, however, the former is true but the latter is not, i.e., ccc is not a counter-example for H<sup>2</sup> at all. ⊓⊔

Generally, given negative information on the composite level (σ /∈ L(M)), it is hard to infer information for a single component M<sup>i</sup> , whereas positive information (σ ∈ L(M)) easily translates back to the level of individual components.

We thus need to relax the guarantees on the answers given by the adapter in the following way:


The procedures that implement the adapter are stated in Listing 1. For each 1 ≤ i ≤ n, we have one instance of each of the functions Member <sup>i</sup> and Equiv<sup>i</sup> , used by the ith learner to pose its queries. Here, we assume that for each component i, a copy of the latest hypothesis H<sup>i</sup> is stored, as well as a set P<sup>i</sup> which contains traces that are certainly in L(Mi). Membership and equivalence queries on M will be forwarded to the teacher via the functions Member (σ) and Equiv(H), respectively.

Membership Queries A membership query σ ∈ L(Mi) can be answered directly by posing σ ∈ L(M) to the teacher if σ contains only actions local to M<sup>i</sup> . However, in the case where σ contains synchronising actions, cooperation from other components M<sup>j</sup> is required. So, during the runtime of the program, for each i we collect traces in a set P<sup>i</sup> , for which it is certain that P<sup>i</sup> ⊆ L(Mi). That is, P<sup>i</sup> contains traces which were returned as positive counter-examples (line 16) or membership queries (line 5). Recall from Section 2 that we can construct tree-LTSs to compute ∥j̸=<sup>i</sup> P<sup>j</sup> , where each P<sup>i</sup> has alphabet Σ<sup>i</sup> . By construction, we have L(∥j̸=<sup>i</sup> P<sup>j</sup> ) ⊆ L(∥j̸=<sup>i</sup> M<sup>j</sup> ), and so we have an under-approximation of the behaviour of other components, possibly including some synchronising actions they can perform. If we fnd in L(∥j̸=<sup>i</sup> P<sup>j</sup> ) a trace σ ′ such that σ and σ ′ contain the same sequence of synchronising actions (line 2, stored in set Π), we construct an arbitrary interleaving (respecting synchronising actions) of σ and σ ′ and forward it to the teacher (line 4). Such an interleaving is a trace σint ∈ L(σ ∥ σ ′ ) of maximal length. Note that a σ ′ ∈ Π trivially exists if σ does not contain synchronising actions. If, on the other hand, no such σ ′ exists, we do not have sufcient information on how other LTSs M<sup>j</sup> can cooperate, and we return 'unknown' (line 7).

Example 5. Refer to the running example in Figure 3. Suppose that the current knowledge about L<sup>2</sup> is H<sup>2</sup> = {ϵ, b}. When Member <sup>1</sup>(c) is called, Π = ∅, because there is no trace σ ′ ∈ P<sup>2</sup> that is equal to c when restricted to {a, c}, therefore unknown is returned. Intuitively, since the second learner has not yet discovered that c or bc (or some other trace containing a c) is in its language, the adapter is unable to turn the query c on L<sup>1</sup> into a query for the composite system. ⊓⊔

Example 6. Suppose now that cac ∈ P1, i.e., we already learned that cac ∈ L(L1). When posing the membership query cbc ∈ L(L2), the adapter fnds that cac and cbc contain the same synchronising actions (viz. cc) and constructs an interleaving, for example cabc. The teacher answers negatively to the query cabc ∈ L(L), and thus we learn that cbc /∈ L(L2). ⊓⊔ Listing 1: Membership and equivalence query procedures for component i.

Input: Alphabets Σ1, . . . , Σ<sup>n</sup> of the components Data: for each i, the latest hypothesis H<sup>i</sup> and a set P<sup>i</sup> of traces, initially {ϵ}. 1 Function Member <sup>i</sup>(σ) 2 Π := {σ ′ ∈ L(∥j̸=<sup>i</sup> P<sup>j</sup> ) | σ ′ ↾Σ<sup>i</sup> = σ↾Σother } where Σother = S <sup>j</sup≯=<sup>i</sup> Σ<sup>j</sup> ; 3 if Π ̸= ∅ then 4 answer := Member (σint) for some σ ′ ∈ Π and some maximal σint ∈ L(σ ∥ σ ′ ) ; /\* construct interleaving \*/ 5 if answer = yes then P<sup>i</sup> := P<sup>i</sup> ∪ {σ}; 6 return answer 7 else return unknown; <sup>8</sup> Function Equiv<sup>i</sup> (H′ ) <sup>9</sup> H<sup>i</sup> := H′ ; 10 while true do 11 barrier(n) ; /\* wait until this point is reached for every i \*/ <sup>12</sup> construct H = ∥<sup>i</sup> Hi; 13 switch Equiv(H) do 14 case yes do return yes; 15 case (no, σ) do <sup>16</sup> if σ /∈ L(H) then P<sup>i</sup> := P<sup>i</sup> ∪ {σ↾Σ<sup>i</sup> }; 17 if a ∈ Σi, where σ = σ ′ a, and σ ∈ L(H) ⇔ σ↾Σ<sup>i</sup> ∈ L(Hi) then <sup>18</sup> return (no, σ↾Σ<sup>i</sup> )

Equivalence Queries For equivalence queries, the adapter ofers functions Equiv<sup>i</sup> . To construct a corresponding query on the composite level, we frst need to gather a hypothesis H<sup>i</sup> for each i. Thus, we synchronise all learners in a barrier (line 11), after which a composite hypothesis can be constructed and forwarded to the teacher (lines 12, 13). An afrmative answer can be returned directly, while in the negative case we investigate the returned counter-example σ. If σ is a positive counter-example, we add its projection to P<sup>i</sup> (line 16). By the assumption that σ is shortest<sup>4</sup> , H and M agree on all σ ′ ∈ Pref (σ)\{σ}. Thus, σ only concerns H<sup>i</sup> if the last action in σ is contained in Σ<sup>i</sup> . Furthermore, we need to check whether H and H<sup>i</sup> agree on σ: it can happen that σ↾Σ<sup>i</sup> ∈ L(Hi) but σ /∈ L(H) due to other hypotheses not providing the necessary communication opportunities. If both conditions are satisfed (line 17), we return the projection of σ on Σ<sup>i</sup> (line 18). Otherwise, we cannot conclude anything about H<sup>i</sup> at this moment and we iterate (line 10). In that case, we efectively wait for other hypotheses H<sup>j</sup> , with j ̸= i, to be updated before trying again. A termination argument is provided later in this section.

<sup>4</sup> This assumption can be satisfed in practice by using a lexicographical ordering on the conformance test suite the teacher generates to decide equivalence.

Example 7. Again considering our running example (Figure 3), suppose the two learners call in parallel the functions Equiv <sup>1</sup> (H1) and Equiv <sup>2</sup> (H2). The provided hypotheses and their parallel composition are as follows:

H<sup>1</sup> = c a H<sup>2</sup> = b c H<sup>1</sup> ∥ H<sup>2</sup> = b c a b

The adapter forwards H = H<sup>1</sup> ∥ H<sup>2</sup> to the teacher, which returns the counterexample cc. The last symbol, c, occurs in both alphabets, but cc ∈ L(H) does not hold and cc↾Σ<sup>2</sup> ∈ L(H2) does, so only Equiv <sup>1</sup> (H1) returns (no, cc). The call to Equiv <sup>2</sup> (H2) hangs in the while loop of line 10 until Equiv <sup>1</sup> is invoked with a diferent hypothesis. ⊓⊔

Example 8. Suppose now that the hypotheses and their composition are:

$$H\_1 = \begin{array}{c} \rightsquigarrow \xleftarrow{c} \mathsf{C} \xleftarrow{c} \mathsf{C} \\ \rightsquigarrow \xleftarrow{c} \mathsf{C} \end{array} \qquad H\_2 = \begin{array}{c} \rightsquigarrow \\ \rightsquigarrow \xleftarrow{b} \\ \rightsquigarrow \\ \rightsquigarrow \xleftarrow{a} \mathsf{C} \end{array} \qquad H\_1 \parallel H\_2 = \begin{array}{c} \rightsquigarrow \xleftarrow{c} \mathsf{C} \\ \rightsquigarrow \xleftarrow{a} \\ \rightsquigarrow \\ \rightsquigarrow \xleftarrow{b} \end{array}$$

When we submit Equiv(H<sup>1</sup> ∥ H2), we may receive the negative counter-example ccc, which is a shortest counter-example. This counter-example does not contain any information to suggest that it only applies to H1. It is a spurious counterexample for H2, since that should contain the trace ccc. ⊓⊔

#### 3.2 L <sup>∗</sup> extensions

As explained in the previous section, the capabilities of our adapter are limited compared to an ordinary teacher. We thus extend L ∗ to deal with the answer 'unknown' to membership queries and to deal with spurious counter-examples.

Answer 'unknown'. The setting of receiving incomplete information through membership queries frst occurred in [15], and is also discussed in [24]. Here we briefy recall the ideas of [15]. To deal with partial information from membership queries, the concept of an observation table is generalised such that the function T : (S ∪ S · Σ) · E → {0, 1} is a partial function, that is, for some cells we have no information. Based on T, we now defne the function row : S ∪ S · Σ → E → {0, 1, ?} to fll the cells of the table: row<sup>T</sup> (s)(e) = T(se) if T(se) is defned and ? otherwise. We refer to '?' as a wildcard; its actual value is currently unknown and might be learned at a later time or never at all. To deal with the uncertain nature of wildcards, we introduce a relation ≈ on rows, where row(s1) ≈ row(s2) if for every e ∈ E, row(s1)(e) ̸= row(s2)(e) implies that row(s1)(e) = ? or row(s2)(e) = ?. Note that ≈ is not an equivalence relation since it is not transitive. Closedness and consistency are defned as before, but now use the new relation ≈. We say an LTS M is consistent with T if for all s ∈ Σ<sup>∗</sup> such that T(s) is defned, we have T(s) = 1 if s ∈ L(M).

As discussed earlier, Angluin's original L <sup>∗</sup> algorithm relies on the fact that, for a closed and consistent table, there is a unique minimal DFA (or, in our case, LTS) that is consistent with T. However, the occurrence of wildcards in the observation table may allow multiple minimal LTSs that are consistent with T. Such a minimal consistent LTS can be obtained with a SAT solver, as described in [19].

Similar to Angluin's original algorithm, this extension comes with some correctness theorems. First of all, it terminates outputting the minimal LTS for the target language. Furthermore, each hypothesis is consistent with all membership queries and counter-examples that were provided so far. Lastly, each subsequent hypothesis has at least as many states as the previous one, but never more than the minimal LTS for the target language.

Spurious Counter-Examples. We now extend this algorithm with the ability to deal with spurious counter-examples. Any negative counter-example σ ∈ L(Hi) might be spurious, i.e., it is actually the case that σ ∈ L(Mi). Since L ∗ excludes σ from the language of all subsequent hypotheses, we might later get the same trace σ, but now as a positive counter-example. In that case, the initial negative judgment from the equivalence teacher was spurious.

One possible way of dealing with spurious counter-examples, is adding to L ∗ the ability to overwrite entries in the observation table in case a spurious counter-example is corrected. However, this may cause the learner to diverge if infnitely many spurious counter-examples are returned. Therefore, we instead choose to add a backtracking mechanism to ensure our search will converge. The pseudo code is listed in Listing 2; we refer to this as L ∗ ?,b (L <sup>∗</sup> with wildcards and backtracking).

We have a mapping BT that stores backtracking points; BT is initialised to the empty mapping (line 1). Lines 5-11 ensure the observation table is closed and consistent in the same way as L ∗ , but use the relation ≈ on rows instead. Next, we construct a minimal hypothesis that is consistent with the observations in T (line 12). This hypothesis is posed as an equivalence query. If the teacher replies with a counter-example σ for which T(σ) = 0, then σ was a spurious counterexample, so we backtrack and restore the observation table from just before T(σ) was introduced (line 15). Otherwise, we store a backtracking point for when σ later turns out to be spurious (line 17); this is only necessary if σ is a negative counter-example. Note that not all information is lost when backtracking: the set P<sup>i</sup> stored in the adapter is unafected, so some positive traces are carried over after backtracking. Finally, we incorporate σ into the observation table (line 18). When the teacher accepts our hypothesis, we terminate.

We fnish this section with an example that shows how spurious counterexamples may be resolved.

Listing 2: Learning with wildcards and backtracking.

```
1 Set BT to ∅;
2 Initialise S and E to {ϵ};
3 Extend T to S ∪ S · Σi by calling Member i;
4 repeat
5 while (S, E, T) is not closed and consistent do
6 if (S, E, T) is not consistent then
7 Find s1, s2 ∈ S, a ∈ Σi, e ∈ E such that row T (s1) ≈ row T (s2) and
             T(s1 · a · e) ̸≈ T(s2 · a · e);
8 Add a · e to S and extend T by calling Member i;
9 if (S, E, T) is not closed then
10 Find s1 ∈ S, a ∈ Σi such that row T (s1 · a) ̸≈ row T (s) for all s ∈ S;
11 Add s1 · a to S and extend T by calling Member i;
12 Call Equivi
                (H) for some minimal LTS H consistent with T;
13 if Teacher replies with counter-example σ then
14 if T(σ) = 0 then /* σ corrects an earlier spurious CEX */
15 (S, E, T) := BT(σ);
16 else if σ ∈ L(H) then /* σ might be spurious */
17 BT(σ) := (S, E, T);
18 Add σ and all its prefxes to S and extend T by calling Member i;
19 until Teacher replies yes to conjecture H;
20 return H ;
```
Example 9. Refer again to the LTSs of our running example in Figure 3. Consider the situation after proposing the hypotheses of Example 8 and receiving the counter-example ccc, which is spurious for the second learner.

In the next iteration, Member <sup>2</sup> can answer some membership queries, such as cbc, necessary to expand the table of the second learner. This is enabled by the fact that P<sup>1</sup> contains cc from the positive counter-example of Example 7 (line 2 of Listing 1). The resulting updated hypotheses are as follows.

H′ <sup>1</sup> = c a <sup>c</sup> <sup>H</sup>′ <sup>2</sup> = c c b b b

Now the counter-example to composite hypothesis H′ <sup>1</sup> ∥ H′ 2 is cacc. The projection on Σ<sup>2</sup> is ccc, which directly contradicts the counter-example received in the previous iteration. This spurious counter-example is thus repaired by backtracking in the second learner. The invocation of Equiv <sup>1</sup> (H′ 1 ) by the frst learner does not return this counter-example, since H′ <sup>1</sup> ∥ H′ <sup>2</sup> and H′ <sup>1</sup> do not agree on cacc, so the check on line 17 of Listing 1 fails.

Finally, in the next iteration, the respective hypotheses coincide with L<sup>1</sup> and L<sup>2</sup> and both learners terminate. ⊓⊔

#### 3.3 Correctness

As a frst result, we show that our adapter provides correct information on each of the components when asking membership queries. This is required to ensure that information obtained by membership queries does not confict with counterexamples. Proofs are omitted for space reasons.

Theorem 1. Answers from Member <sup>i</sup> are consistent with L(Mi).

Before presenting the main theorem on correctness of our learning framework, we frst introduce several auxiliary lemmas. In the following, we assume n instances of L ∗ ?,b run concurrently and each queries the corresponding functions Member <sup>i</sup> and Equiv<sup>i</sup> , as per our architecture (Figure 2). First, a counter-example cannot be spurious for all learners; thus at least one learner obtains valid information to progress its learning.

Lemma 1. Every counter-example obtained from Equiv (H) is valid for at least one learner.

The next lemma shows that even if a spurious counter-example occurs, this does not induce divergence, since it is always repaired by a corresponding positive counter-example in fnite time.

Lemma 2. If Equiv(H) always returns a shortest counter-example, then each spurious counter-example is repaired by another counter-example within a fnite number of invocations of Equiv (H), the monolithic teacher.

Our main theorem states that a composite system is learned by n copies of L ∗ ?,b that each call our adapter (see Figure 2).

Theorem 2. Running n instances of L ∗ ?,b terminates, and on termination we have H<sup>1</sup> ∥ · · · ∥ H<sup>n</sup> = M<sup>1</sup> ∥ · · · ∥ Mn.

Remark 2. We cannot claim the stronger result that H<sup>i</sup> = M<sup>i</sup> for all i, since different component LTSs can result in the same parallel composition. For example, consider the below LTSs, both with alphabet {a}:

H<sup>1</sup> = H<sup>2</sup> = a

Here we have H<sup>1</sup> ∥ H<sup>2</sup> = H<sup>1</sup> ∥ H1. The equivalence oracle thus may also return yes even when the component LTSs difer slightly.

#### 3.4 Optimisations

There are a number of optimisations that can dramatically improve the practical performance of our learning framework. We briefy discuss them here.

First, fnding whether there is a trace σ ′ ∈ Π (line 2 of Listing 1) can quickly become expensive once the sets P<sup>i</sup> grow larger. We thus try to limit the size of each P<sup>i</sup> without impacting the amount of information it provides on the

synchronisation opportunities ofered by component M<sup>i</sup> . Therefore, when we derive that σ ∈ L(Mi), we only store the shortest prefx ρ of σ such that ρ and σ contain the same synchronising actions. That is, σ = ρ · ρ ′ and ρ ′ contains only actions local to M<sup>i</sup> . Furthermore, we construct ∥j̸=<sup>i</sup> P<sup>j</sup> only once after each call to Equiv<sup>i</sup> and we cache accesses to ∥j̸=<sup>i</sup> P<sup>j</sup> , such that it is only traversed once when performing multiple queries σ 1 , σ 2 for which it holds that σ 1 ↾Σother = σ 2 ↾Σother . A possibility that we have not explored is applying partial-order reduction to eliminate redundant interleavings in ∥j̸=<sup>i</sup> P<sup>j</sup> .

Since the language of an LTS is prefx-closed, we can – in some cases – extend the function T that is part of the observation table without performing membership queries. Concretely, if T(σ) = 0 then we can set T(σ · σ ′ ) = 0 for any trace σ ′ . Dually, if T(σ · σ ′ ) = 1 then we set T(σ) = 1.

### 4 Experiments

We created an experimental implementation of our algorithms in a tool called Coal (COmpositional Automata Learner) [27], implemented in Java. It relies on LearnLib [22], a library for automata learning, which allows us to re-use standard data structures, such as observation tables, and compare our framework to a state-of-the-art implementation of L ∗ . To extract a minimal LTS from an observation table, we frst attempt the inexact blue-fringe variant of RPNI [20] (as implemented in LearnLib). If this does not result in an LTS that is minimal, we resort to an exact procedure based on a SAT translation; we use the Z3 solver [10].

Our experiments are run on a machine with an Intel Core i3 3.6GHz, with 16GB of RAM, running Ubuntu 20.04. For each experiment, we use a time-out of 30 minutes.

### 4.1 Random Systems

We frst experiment with a large number of composite systems where each of the component LTSs is randomly generated. This yields an accurate refection of actual behavioural transition systems [16]. Each component LTS has a random number of states between 5 and 9 (inclusive, uniformly distributed) and a maximum number of outgoing edges per state between 2 and 4 (inclusive, uniformly distributed).

We assign alphabets to the components LTSs in fve diferent ways that refect real-world communication structures, see Figure 4. Here, each edge represents a communication channel that consists of two synchronising actions; each component LTS furthermore has two local actions. The hyperedge in multiparty indicates multiparty communication: the two synchronising actions in such a system are shared by all component LTSs. The graph that represents the bipartite communication structure is always complete, and the components are evenly distributed between both sides. Random is slightly diferent: it contains 2(n−1)

Fig. 4: Communication structure of the randomly generated systems. Dots represent components LTSs; edges represent shared synchronising actions.

edges, where n is the number of components, each consisting of one action; we furthermore ensure the random graph is connected.

For our fve communication structures, we create ten instances for each number of components between 4 and 9; this leads to a total benchmark set of 300 LTSs. Out of these, 47 have more than 10,000 states, including 12 LTSs of more than 100,000 states. The largest LTS contains 379,034 states. Bipartite often leads to relatively small LTSs, due to its high number of synchronising actions.

Fig. 5: Performance of L <sup>∗</sup> and compositional learning on random models.

On each LTS, we run the classic L <sup>∗</sup> algorithm and Coal, and record the number of queries posed to the teacher.<sup>5</sup> The result is plotted in Figure 5; note the log scale. Here, marks that lie on the dashed line indicate a time-out or out-of-memory for one of the two algorithms.

Coal outperforms the monolithic L <sup>∗</sup> algorithm in the number of membership queries for all cases (unless it fails). In more than half of the cases, the

<sup>5</sup> The number of queries is the standard performance measure for query learning algorithms; runtime is less reliable, as it depends on the specifc teacher implementation.


Table 1: Performance of Coal and L ∗ for realistic composite systems.

diference is at least three orders of magnitude; it can even reach six orders of magnitude. For equivalence queries, the diference is less obvious, but our compositional approach scales better for larger systems. This is especially relevant, because in practice implementations equivalence queries may require a number of membership queries that is exponential in the size of the system. Multiparty communication systems beneft most from compositional learning. The number of spurious counter-examples that occurs for these models is limited: about one on average. Only twelve models require more than fve spurious counterexamples; the maximum number required is thirteen. This is encouraging, since even for this varied set of LTSs the amount of duplicate work performed by Coal is limited.

#### 4.2 Realistic Systems

Next, we investigate the performance of Coal on two realistic systems that were originally modelled as a Petri net. These Petri nets can be scaled according to some parameters to yield various instances. The ProdCons system models a bufer of size K that is accessed by P producers and C consumers; it is described in [32, Fig. 8]. The CloudOpsManagement net is obtained from the 2019 Model Checking Contest [2], and describes the operation of C containers and operating systems and W application runtimes in a cloud environment. Furthermore, we scale the number N of application runtime components. We generate the LTS that represents the marking graph of these nets and run L <sup>∗</sup> and Coal; the results are listed in Table 1. For each system, we list the values of scaling parameters, the number of components and the number of states of the LTS. For Coal and L ∗ , we list the runtime and the number of membership and equivalence queries; for Coal we also list the number of spurious counter-examples (column spCE).

The results are comparable to our random experiments: Coal outperforms L ∗ in number of queries, especially for larger systems. For the two larger CloudOpsManagement instances, the increasing runtime of Coal is due to the fact that two of the components grow as the parameter W increases. The larger number of states causes a higher runtime of the SAT procedure for constructing a minimal LTS.

We remark that in our experiments, the teacher has direct access to the LTS we aim to learn, leading to cheap membership and equivalence queries. Thus, in this idealised setting, L ∗ incurs barely any runtime penalty for the large number of queries it requires. Using a realistic teacher implementation would quickly cause time-outs for L ∗ , making the results of our experiments less insightful.

### 5 Related Work

Finding ways of projecting a known concurrent system down into its components is the subject of several works, e.g., [8,17]. In principle, it would be possible to learn the system monolithically and use the aforementioned results. However, as shown in Section 4, this may result in a substantial query blow-up.

Learning approach targeting various concurrent systems exist in the literature. As an example of the monolithic approach above, the approach of [6] learns asynchronously-communicating fnite state machines via queries in the form of message sequence charts. The result is a monolithic DFA that is later broken down into components via an additional synthesis procedure. This approach thus does not avoid the exponential blow-up in queries. Another diference with our work is that we consider synchronous communication.

Another monolithic approach is [18], which provides an extension of L ∗ to pomset automata. These automata are acceptors of partially-ordered multisets, which model concurrent computations. Accordingly, this relies on an oracle capable of processing pomset-shaped queries; adapting the approach to an ordinary sequential oracle – as in our setting – may cause a query blow-up.

A severely restricted variant of our setting is considered in [13], which introduces an approach to learn Systems of Procedural Automata. Here, DFAs representing procedures are learned independently. The constrained interaction of such DFAs allows for deterministically translating between component-level and system-level queries, and for univocally determining the target of a counterexample. Our setting is more general – arbitrary (not just pair-wise) synchronisations are allowed at any time – hence these abilities are lost.

Two works that do not allow synchronisation at all are [23,25]. In [23] individual components are learned without any knowledge of the component number and their individual alphabets, however components cannot synchronise (alphabets are assumed to be disjoint). This is a crucial diference with our approach, which instead has to deal with unknown query results and spurious counterexamples precisely due to the presence of synchronising actions. An algorithm for learning Moore machines with decomposable outputs is propose in [25]. This algorithm spawns several copies of L ∗ , one per component. This approach is not applicable to our setting, as we do not assume decomposable output and allow dependencies between components.

Other approaches consider teachers that are unable to reply to membership queries [1,14,15,24]; they all use SAT-based techniques to construct automata. The closest works to ours are: [24], considering the problem of compositionally learning a property of a concurrent system with full knowledge of the components; and [1], learning an unknown component of the serial composition of two automata. In none of these works spurious counter-examples arise.

## 6 Conclusion

We have shown how to learn component systems with synchronous communication in a compositional way. Our framework uses an adapter and a number of concurrent learners. Several extensions to L <sup>∗</sup> were necessary to circumvent the fundamental limitations of the adapter. Experiments with our tool Coal show that our compositional approach ofers much better scalability than a standard monolithic approach.

In future work, we aim to build on our framework in a couple of ways. First, we want to apply these ideas to all kinds of extensions of L ∗ such as TTT [21] (for reducing the number of queries) and algorithms for learning extended fnite state machines [7]. Our expectation is that the underlying learning algorithm can be replaced with little efort. Next, we want to eliminate the assumption that the alphabets of individual components are known a priori. We envisage this can be achieved by combining our work and [23].

We also would like to explore the integration of learning and model-checking. A promising direction is learning-based assume-guarantee reasoning, originally introduced by Cobleigh et. al. in [9]. This approach assumes that models for the individual components are available. Using our approach, we may be able to drop this assumption, and enable a fully black-box compositional verifcation approach.

Acknowledgements. We thank the anonymous reviewers for their useful comments, and Tobias Kapp´e for suggesting several improvements. This research was partially supported by the EPSRC Standard Grant CLeVer (EP/S028641/1).

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Concolic Testing of Front-end JavaScript

Zhe Li() and Fei Xie

Portland State University, Portland, OR 97201, USA {zl3,xie}@pdx.edu

Abstract. JavaScript has become the most popular programming language for web front-end development. With such popularity, there is a great demand for thorough testing of client-side JavaScript web applications. In this paper, we present a novel approach to concolic testing of front-end JavaScript web applications. This approach leverages widely used JavaScript testing frameworks such as Jest and Puppeteer and conducts concolic execution on JavaScript functions in web applications for unit testing. The seamless integration of concolic testing with these testing frameworks allows injection of symbolic variables within the native execution context of a JavaScript web function and precise capture of concrete execution traces of the function under test. Such concise execution traces greatly improve the efectiveness and efciency of the subsequent symbolic analysis for test generation. We have implemented our approach on Jest and Puppeteer. The application of our Jest implementation on Metamask, one of the most popular Crypto wallets, has uncovered 3 bugs and 1 test suite improvement, whose bug reports have all been accepted by Metamask developers on Github. We also applied our Puppeteer implementation to 21 Github projects and detected 4 bugs.

Keywords: Concolic Testing · JavaScript · Front-end Web Application.

### 1 Introduction

JavaScript (JS), as the most popular web frond-end programming language, is used by 95.1% of websites [23]. Many of such websites handle sensitive information such as fnancial transactions and private conversions. Errors in these websites not only afect user experiences, but also endanger safety, security, and privacy of users. Therefore, these websites, particularly their dynamic functions that are often implemented in JS, must be thoroughly tested to detect software bugs. There have been many testing frameworks for JS applications, such as Jest and Puppeteer. These frameworks provide a systematic way to test JS applications and reduce the tedious testing setup, particularly for unit testing. However, although these testing frameworks simplify the execution of testing, they do not provide test data for web applications. Such test data still needs to be provided manually by application developers, which is often very time-consuming and laborious. And achieving high code and functional coverage on web applications with high-quality test data still remains a challenge [34].

Symbolic execution has shown great promises in software testing, particularly in test data generation [29]. It exercises software with symbolic inputs, explores its execution paths systematically, and generates test data for the paths explored symbolically. However, symbolic execution may sufer from path explosions when the software has too many paths to explore [26]. Concolic testing addresses path explosion by combining concrete execution with symbolic execution. The software is frst exercised with a concrete input and the resulting concrete execution trace is then analyzed symbolically to explore paths that are adjacent to the concrete trace. Concolic testing has achieved many successes in software testing [25]. It is strongly desirable to apply concolic testing to front-end JS web application to generate high-quality test data automatically, so manually eforts can be reduced and test coverage can be improved. However, front-end JS applications pose major challenges to concolic testing. These applications typically execute in the contexts of web browsers, which tends to be complex, and they are usually event-driven, user-interactive, and string-intensive [35].

In this paper, we present a novel approach to concolic testing of front-end JS web application. This approach leverages widely used JS testing frameworks such as Jest and Puppeteer and conducts concolic execution on JS web functions for unit testing [39]. These testing frameworks isolate the web function under test from the context of its embedding web page by mocking the environment and provide the test data that drives the function. This isolation of web function provides an ideal target for application of concolic testing. We integrate concolic testing APIs into these testing frameworks. The seamless integration of concolic testing allows injection of symbolic variables within the native execution context of a JS web function and precise capture of concrete execution traces of this function. As the testing framework executes the function under test with test data, parts or all of the test data can be made symbolic and the resulting execution traces of the function are captured for later symbolic analysis. Concise execution traces greatly improve the efectiveness and efciency of the subsequent symbolic analysis for test generation. The new test data generated by the symbolic analysis is again fed back to the testing frameworks to drive further concolic testing.

We have implemented our approach on Jest and Puppeteer. The application of our Jest implementation to Metamask, one of the most popular Crypto wallets, has uncovered 3 bugs and 1 test suite improvement, whose bug reports have been accepted by Metamask developers on Github. We have also applied our Puppeteer implementation to 21 Github projects and detected 4 bugs.

### 2 Background

### 2.1 Front-end JavaScript Testing Frameworks

In a general software testing framework, a test case is designed to exercise a single, logical unit of behavior in an application and ensure the targeted unit operates as expected [21]. Typically, it is structured as a tuple {P, C, Q}:


As shown in Figure 1, a front-end JS testing framework inspects the web application in the browser for JS functions to test. It utilizes testing libraries to

Fig. 1: Front-end JS testing framework workfow

obtain the web pages, parses them and stores page functions and their context information individually so that test runners can run the functions browserless [4]. The test runner sets up the three parts of a test case for each JS function under test and then executes the test case. The front-end JS testing framework helps isolate the JS function under test and provides the execution context for testing the function, which is an ideal entry for our application of the concolic testing to front-end JS.

#### 2.2 In-situ Concolic Testing of Backend JavaScript

In [9], a new approach has been introduced to applying concolic testing to backend JS in-situ, i.e., scripts are executed in their native environments (e.g., Node.js) as part of concolic execution and test cases generated are directly replayed in these environments [13]. As illustrated in Figure 2, the concrete execution step of concolic testing as indicated by the dashed box on top is conducted in the native execution environment for JS, where the trace of this concrete execution is captured. The trace is then analyzed in the symbolic execution step of concolic testing to generate test cases that are then fed back into the native concrete execution to drive further test case generation. This approach has been implemented on the Node.js execution environment and its V8 JS engine [24]. As a script is executed with Node.js, its binary-level execution trace is captured and later analyzed through symbolic execution for test case generation. It also ofers the fexibility of customizing trace as needed. We leverage this functionality in our approach.

Fig. 2: Workfow for in-situ concolic testing of backend JavaScript

# 3 Approach

### 3.1 Overview

Our approach strives to apply concolic testing on front-end JS web applications to generate efective test data for unit-testing of these applications. Below are the specifc design goals for our approach:


With the above goals in mind, we design an approach to concolic testing of front-end JS web application, which leverages the JS testing frameworks such as Jest and Puppeteer and conducts concolic execution on JS web functions for unit testing. The seamless integration of concolic testing with these testing frameworks is achieved through extending in-situ concolic testing of backend JS applications. Figure 3 illustrates how the integration is realized:


(c) Workfow for enabling efective in-situ concolic testing on front-end JS

Fig. 3: Overview for concolic testing of front-end JS

3. Workfow 3 in Figure 3c illustrates how we leverage a JS testing framework to extract the front-end JS web function and its execution context from the web page. In the extraction, we encapsulate them as a pure JS function augmented with the web page information, inject symbolic values and capture execution traces for later symbolic analysis by calling the symbolic execution interface functions within the extracted execution context. We then utilize the test runner of the JS testing framework to initiate and drive concolic testing within the execution context to generate new test data.

This workfow allows faithful simulation of the execution context of a JS web function without the presence of a web browser. It enables injection of symbolic variables and captures of concrete execution traces within the execution context of the JS web function under test. A concise and accurate concrete execution trace can greatly improve the efectiveness and efciency of the following symbolic analysis for test generation. We explain how to decide the starting point of tracing within the native execution context and what diference it makes in Section 3.2.

### 3.2 Concolic Testing of JS Web Function within Execution Context

A front-end JS web function is invoked from a web page and its execution depends on the execution context from the web page [28]. The core of our approach is to enable concolic testing on the JS web function within its native execution context from the web page in a manner same as in-situ concolic execution of back-end JS. We can achieve this by the following three steps: execution context extraction, execution context tracing customization (including symbolic value injection and tracing control), and concolic execution within execution context.

Fig. 4: Concolic testing of JS Web function within execution context

Execution Context Extraction To transform a JS web function to a pure JS function without losing its context of a web page, we introduce a function interceptor to the JS testing framework to serve this purpose. As shown in Figure 4, the function interceptor completes the following tasks to fnish this transformation in order to suit later in-situ concolic testing in the back-end:


JS function's native web environment, it extracts the associated execution context of the web page. This is realized by calling helper functions provided by the testing libraries of the JS testing framework. The execution context contains everything that is needed for the pure JS function to be executed in the web page, which includes the arguments of the function, its concrete dependency objects set by mocking data and the function scope.

– Third, the function interceptor delivers a complete function in the pure JS form encapsulated with its associated web execution context by assembling them, and then makes it accessible for the test runner of the JS testing framework so that the test runner can initiate the concolic execution in the execution context when running the test suite.

Execution Context Tracing Customization In-situ concolic testing ofers the capability of tracing inside the V8 JS engine to capture the execution trace that closely matches the JS bytecode interpretation [9,22]. The conciseness of an execution trace determines the efciency and the efectiveness of later symbolic analysis and test case generation. Therefore, to make the most of this capability, we pinpoint the locations of where to introduce symbolic values and start tracing during the extraction of the execution context, before we commence concolic testing on the encapsulated JS web function with its execution context. Insitu concolic testing provides interface functions for introducing the symbolic values (MarkSymbolic()) and tracing control (StartTracing()). We use these interface functions to customize execution context tracing as needed.

Symbolic Value Injection and Tracing Control A JS testing frameworks uses a test runner to execute its test suites. As shown in Figure 5, the test runner prepares the dependencies for setting up the testing environment and loads the JS libraries the test suites need before starting run the individual function under test. In order to avoid tracing the unnecessary startup overhead of the test runner

Fig. 5: How to avoid unnecessary tracing of the test runner setup by delaying injection of symbolic values and start of tracing

(indicated by the red box in Figure 5), we choose to inject symbolic values inside the execution context and start tracing when the test runner actually executes the encapsulated function, by calling the interface functions the insitu concolic testing provides. This way the execution tracer only captures the execution trace of the encapsulated JS web function. The locations for injecting symbolic values and starting tracing are indicated in the "Execution Context (EC)" box in Figure 4 and the captured execution trace is indicated by the "Execution Trace" box in the right corner of Figure 4.

Most Concise Execution Trace Figure 6 shows why our approach can obtain the most concise execution trace for the JS web function driven by the test runner of the JS testing framework. Apart from the overhead caused by the test

Fig. 6: How we obtain the most concise concrete execution trace

runner, the extraction of the execution context for the JS web function involves calling a set of JS helper functions to collect web page information, such as helper js 1 and JSHandle js 1. If we directly apply symbolic execution within the test runner where the JS function is intercepted along with the execution context extraction, the execution tracer will also capture the execution traces of the test runner and the testing helper functions from the testing libraries shown as "Execution Trace 0" in the right-hand side of Figure 6. We modifed the test runner to mark symbolic variables and enable tracing control within the execution context. Instead of starting tracing when the test runner starts, we defer the tracing of the execution to when and where the test runner actually executes the encapsulated function under test in the extracted execution context, indicated by the "Execution Trace 1" in the left-hand side of Figure 6. This way we minimize the extend of execution tracing needed.

Concolic Testing within Execution Context We leverage the test runner of the JS testing framework to initiate and start the in-situ concolic testing of the JS web function under test. Typically the test runner starts running the JS web function with an existing unit test. In our approach, the execution of the unit test triggers the function interceptor, which starts the process of extracting the execution context and encapsulating the target JS web function. During this process, symbolic values are injected and tracing is started in the right place as described in previous sections. The resulting pure JS application is then executed by in-situ concolic testing. Newly generated test data is fed back to the JS testing framework to drive further concolic testing.

# 4 Implementations

In this section, we demonstrate the feasibility of our approach to concolic testing of front-end JS functions by implementing it on two popular JS testing frameworks, namely Puppeteer and Jest assisted by the React testing library [18,14].

### 4.1 Implementation on Puppeteer

Puppeteer is a testing framework developed by the Chrome team and implemented as a Node.js library [14]. It provides a high-level API to interact with headless (or full) Chrome. It can simulate browser functions using testing libraries. Puppeteer can execute JS functions residing in a web page without a browser. Puppeteer allows us to easily navigate pages and fetch information about those pages. In the implementation of our approach on Puppeteer, we augment it with the implementation of the function interceptor to identify the targeted web JS functions and extract their execution contexts from the web pages and encapsulate them for in-situ concolic testing.

Encapsulating JS Web Function with Execution Context As shown in Figure 7, Puppeteer communicates with the browser [15]. One browser instance can own multiple browser contexts. A Browser Context instance defnes a browsing session and can have more than one pages. The Browser Context provides a way to operate an independent browser session [3]. A Page has at least one frame. Each frame has a default execution context. The default execution context is where the frame's JavaScript is executed. This context is returned by frame.executionContext() method, which gives the detail about a page frame. We implement the function interceptor in the Execution Context class under the browser context to collect necessary information for encapsulating a JS function with its associated web execution context. The Execution Context class

Fig. 7: How Puppeteer executes a JS function in a web page

represents a context for JS execution in the web page. We modifed it to identify the page function, its arguments and return value [5]. The pageFunction is the function in the HTML page to be evaluated in the execution context, which is in a pure JS form. For example, Listing 1.1 shows a front-end application example written with the Express web development framework [6]. This example contains a web page (from line 7 to line 17) with a JS web function marked by <script> tag in line 15. The \${path} points to the JS fle that contains the implementation of the JS web function, as shown in Listing 1.2. Our approach is able to encapsulate the pure JS form of the web JS function (its implementation) with its associated web execution context.

```
Listing 1.1: An example of a front-end web application using Express framework
```

```
1 const app = express ()
2 .use( middleware ( compiler , { serverSideRender: true }) )
3 .use (( req , res) => {
4 const webpackJson =
         res.locals.webpack.devMiddleware.stats.toJson ()
5 const paths = getAllJsPaths ( webpackJson )
6 res.send (
7 '<!DOCTYPE html >
8 <html >
9 <head >
10 <title > Test </title >
11 </head >
12 <body >
13 <div id=" root "> </div >
14 ${ paths.map (( path ) =>
15 '<script src= "${ path }"> </script > ') . join ('') }
16 </body >
17 </html >'
18 )
19 })
```
Listing 1.2: An example of a front-end JS script under Express framework

```
1 function foo( args ) {
2 if( args === 'foo ') {
3 return 'match ';
4 }
5 return 'not match ';
6 }
7 module . exports = foo;
```
Execution Context Tracing Customization We utilize the page.evaluate function of the Puppeteer testing framework to drive the JS function under test and extend it with the function interceptor. As described in Figure 8, to enable customized execution context tracing, the function interceptor introduces symbolic variables and set the starting point for tracing within the web execution context of the JS function wrapped by the <script> tag in the web page. This way, we make it possible for the test runner to initiate concolic testing when it starts running the test suites so that JS function can be tested concolically and automatically without tracing additional overheads. Since the Execution Context is triggered by the evaluate function in unit tests. We target applications from GitHub that uses Puppeteer to test front-end features and utilizes evaluate in unit testing. We will discuss the results later in Section 5.

Fig. 8: How we set symbolic variables in the execution context and enable customized execution context tracing in Puppeteer

### 4.2 Implementation on Jest with React Testing Library

Another implementation of our approach is on the Jest testing framework assisted by the React testing library for unit testing. The React testing library is a lightweight library for testing React components that wrap the JS functions with the HTML elements [18]. As shown in Figure 9, there are three components in the application as indicated by the numbers. Components allow the splitting of a UI into independent, reusable pieces, and designing each piece in isolation. React is fexible; however, it has a strict rule: all React components must act as pure functions with respect to their inputs [16]. We refer to them as "functional components". They accept arbitrary inputs (called "props") and return React elements describing what should appear on the web page [17]. An individual component can be reused in diferent combinations. Therefore, the correctness of an individual component is important with the respect to the correctness of their compositions. In our implementation, we only consider components that have at least one input.

Fig. 9: Example React Components

Jest has a test runner, which allows us to run tests from the command line. Jest also provides additional utilities such as mocks, stubs, etc., besides the utilities of test cases, assertions, and test suites. We use Jest's mock data to set up the testing environment for the front-end components defned with React. Figure 10 shows how we leverage and extend Jest assisted by React testing library to apply the in-situ concolic testing to React component. To encapsulate the JS function in the component with its execution context, we aug-

mented the render function, whose functionality is to render the React component function and props as an individual unit for Jest to execute from the web page, with the function interceptor. Through the render function, the function interceptor extracts a complete execution context for the functional component and intercepts the JS function wrapped in the functional component indicated by the arrows in Figure 10. To enable customized execution context tracing, the function interceptor then marks symbolic variables and starts tracing after the completion of the encapsulation. At last, we confgure Jest's test runner to run each unit test individually while initiating in-situ concolic execution so that we can obtain the most concise execution traces for later symbolic analysis.

Fig. 10: How to apply in-situ concolic testing on React components using Jest

### 5 Evaluations

For evaluations, we apply our approach to in-situ concolic testing on front-end JS web application projects that come with unit test suites. They are utilizing Jest with React testing library and Puppeteer. In these evaluations, we target the String and Number types as symbolic variables for the functions under test.

### 5.1 Evaluation of Puppeteer Implementation on Github Projects

We have selected 21 GitHub projects utilizing Puppeteer. We test them using the Puppeteer framework extended with our concolic testing capability. As a result, we discovered 4 bugs triggered from their web pages and 2 of them originated from their dependency libraries.

Evaluation Setup We selected GitHub projects with the following properties as as our targets:


We have developed a script based on such properties and used the searching API provided by GitHub to collect applicable projects [20]. 21 projects were collected. Table 1 summarizes the demographics of the 21 GitHub projects collected by our script. We calculated the statistics using ls-fles [7] combined with cloc provided by GitHub [8]. The LoC/JS is the LoC (lines of code) of all JS fles, which includes the JS fles of the libraries the project depends on. The LoC/HTML is the LoC of HTML fles, which indicates the volume of its front-end web contents. The LoC of unit tests (LoC/unit test) includes the unit test fles ending with .test.js. The test ratio is the ratio between the LoC/unit test over the LoC/JS, indicating the availability of unit tests for the projects. Before evaluation, we confgure these projects to use the extended Puppeteer framework instead of the original one.

Result Analysis We ran each project with our approach for 30 minutes. On average, our implementation generates 200 to 400 test cases for each function. Table 2 summarizes the bugs detected. For polymer, our method generates two types of test cases that trigger two diferent bugs in user password validation functionalities of the project: 1) a generated test case induces execution to skip an if branch, which causes the password to be undefined, leading to the condition !password || this.password === password to return true, which should have returned false. We have fxed this bug by changing the operator || to &&. 2) test cases containing unicode characters fail password pattern matching using regular expression without g fag, i.e., /[!@#\$%^&\*(),.?":|<>]/.test(value). For InsugarTrading, a test case of a string not containing comma is generated for


Table 1: Selected Projects that utilize Puppeteer for unit testing

str.split(',') function. The return value of an empty array causes errors in the dependency library cookie-connoisseur. A number out-of-bound error is discovered in the changeCell() function of TicTacToe. For phantomas, function phantomas has a check for url to be the string type but does not have pattern matching for it. A generated test case with an invalid url causes an exception in function addScriptToEvaluateOnNewDocument of chromeDevTools.

Table 2: Bugs detected in web applications using Puppeteer from Github


We identifed two traits of the projects for which we did not detect bugs in. (1) A project does not ft the design of our Puppeteer implementation, i.e., evaluate is not used in the test suite. (2) The applicable JS part is small and well tested.

### 5.2 Evaluation of Jest Implementation on Metamask

In evaluation of the implementation of our concolic testing approach on Jest, we focus on Metamask's browser extension for Chrome. MetaMask is a software crypto-currency wallet used to interact with the Ethereum blockchain. It allows users to access their Ethereum wallet through a browser extension or mobile app, which can then be used to interact with decentralized applications [12]. Metamask extension utilizes the render functionality for testing JS functions in React components. We focus on front-end JS web functions, React component functions in particular. They reside in the ui folder of the metamask-extension project.

Testing Coverage Statistics of Metamask We select the ui folder as our evaluation target for two reasons: (1) React components of metamask-extension are mostly defned and implemented under this folder; (2) the functions in this folder is under tested. Figure 11 shows the current testing coverage statistics of the ui folder of metamask-extension [1]. We can see that only one sub-folder of ui (which also happens to be named as ui) has a relatively high coverage of 82.03%. Most other folders have coverage under 70% or even lower coverage.

Fig. 11: Coverage statistics of ui folder of Metamask-extension

Evaluation Setup In the unit testing workfow of metamask-extension, there is a global confguration for all unit test suites of UI components. This is because one component's functionality may depend on other components. Therefore, metamask-extension needs to be executed as an instance to support unit testing. To evaluate the implementation of in-situ concolic testing for React components, we need an independent environment for each component function wrapped with a single test fle. This test fle only contains one function under test. Therefore, each test fle is an independent in-situ concolic testing runner for a function in a component. We implement an evaluation setup script to complete this task. This script automatically prepares the evaluation environment for in-situ concolic testing of a React component. Specifcally, it does the following work under the folder where the target component resides:


– Dependency Installation. Collect and install dependencies for the target component. Such dependencies can be components or libraries.

Result Analysis After we set up the evaluation environment, we can conduct our evaluation in a sandbox on the test network of Metamask. We have uncovered 3 bugs and 1 test suite improvement as shown in Table 3. We have fled them as bug reports through GitHub. They have been accepted by Metamask developers. Along the way, we also found some similar test cases that Metamask's bot reported.


Table 3: Bugs Detected in Metamask under UI folder

For the buy-eth feature as shown in Figure 12, a test network error with a respond code of 500 was triggered when testing the Ether deposit functionality. Concolic testing generates a test case of an invalid chainId for buyEth(), which is defned in the DepositEtherModal component. It is wrapped by a <Button> tag and can be triggered by onClick(). buyEth() calls into buyEthUrl(), which retrieves a url for buyEth() function. Because buyEthUrl() did not check if the url is valid or null before it calls openTab(url) with the returned url. And there is also no validation for input in the component implementation. Additionally, this process was not wrapped in a try/catch block. We caught this error in our evaluation. We tested 16 component folders and discovered that metamask-extension most likely will ignore input checking if inputs are not directly from users. chainId is retrieved from mock data in this case, which is generated by our concolic engine.

Fig. 12: Error trace of the bug discovered in buy-eth

For the token-search feature, we uncovered a bug triggered by an empty string. In the TokenSearch component, function handleSearch() is wrapped by <TextField> with onChange method. It calls isEqualCaseInsensitive() with an empty string as its second argument without boundary checking. Function isEqualCaseInsensitive is defned in utils.js, which provides shared functions. We found that the unit testing for utils.js do not have test suites for that function, while the same bug is not found in the experiment conducted on the send.js fle. In send.js, function validateRecipientUserInput also calls the incorrect function isEqualCaseInsensitive. However, since send.js checks for both empty string and null inputs before calling the faulty function, it avoids the potential error in utils.js.

For the ens-input feature, in the onChange method of component EnsInput's <input/>, the function isValidDomain is called. Our approach generated test cases with unacceptable ASCII characters in the domain name, e.g., %ff.bar. We replay this test case, function isValidDomain returns true when it should return false. In Listing 1.3, function isValidDomain returns the value of the condition match !== undefined. This test case made through regex matching and returned null but null is not equal to undefined in JS.

Listing 1.3: A code segment of utils.js with function isValidDomain showing incorrect behavior in line 8

```
1 function isValidDomainName (\% ff.bar) {
2 var match = punycode
3 . toASCII ( address )
4 . match (
5 /^(?:[a-z0 -9](?:[ -a-z0 -9]*[a-z0 -9]) ?\.) +[a-z0 -9][ -a
            -z0 -9]*[a-z0 -9]$/u,
6 ) ;
7 // After match function , returning string match = null ;
        therefore , match !== undefined return true .
8 return match !== undefined ;
9 }
```
For the advanced-gas-fee feature, we found the updateGasLimit(gasLimit) function (expecting a numeric input) in the <FormField> component has wrong behavior when given a string input containing only digits such as "908832". The function simply sets the gas limit to 0 without emitting error. We do not consider this as a bug since component <FormField> restricted the input to be numeric in the HTML element. After we fled it, this has been marked with the area-testSuite tag on GitHub by developers as a test suite improvement.

### 6 Related Work

Our approach is closely related to work on symbolic execution for JS. Most of them aim at back-end/standalone JS programs, primarily target specifc bug patterns and depend on whole-program analysis. Jalangi works on pure JS programs and instruments the source JS code to collect path constraints and data for replaying [38]. COSETTE is another symbolic execution engine for JS using an intermediate representation, namely JSIL, translated from JS [36]. ExpoSE applies symbolic execution on standalone JS and uses JALANGI as its symbolic execution engine. ExpoSE's contribution is in addressing the limitation that JALANGI has, which is to support regular expressions for JS [33]. There are few symbolic analysis frameworks for JS web applications. Oblique injects symbolic JS library into the page's HTML. When a user loads the page, it conducts a symbolic page load to explore the possible behaviors of a web browser and a web server during the page load process. It generates a list of pre-fetch url for client-side to speed up page load [30]. It is an extension of the ExpoSE concolic engine. SymJS is a framework for testing client-side JS script and mainly focus on automatically discovering and exploring web events [31]. It modifes Rhino JS engine for symbolic execution [27,19]. Kudzu targets AJAX applications and focuses on discovering code injection vulnerabilities by implementing a dynamic symbolic interpreter that takes a simplifed intermediate language for JS [37]. To the best of our knowledge, there has been no publicly available symbolic execution engines targeting JS functions embedded in front-end web pages [32].

Another related approach to JS testing is fuzzing, which typically uses code coverage as feedback to test generation. There are a few fuzzers for JS, e.g., jsfuzz [11] and js-fuzz [10], which are largely based on the fuzzing logic of AFL (American fuzzy lop) [2] and re-implemented it for JS. We view fuzzing and symbolic/concolic testing as complementing techniques: fuzzing for broader exploration of JS while symbolic/concolic testing for deeper exploration.

# 7 Conclusions

We have presented a novel approach to apply concolic execution to front-end JS. The approach makes use of an in-situ concolic executor for JS and leverages the functionality of JS testing frameworks as test runners and web content extractors. Our approach works in three steps: (1) extracting JS functions from web pages using with JS testing framework; (2) integrating the in-situ concolic testing interface in the execution context for the JS Web functions; (3) utilizing the testing framework's test runner and its mock data as the driver for concolic execution to generate additional test data for the JS web function under test.

We have conducted evaluation on open-source projects from Github and on Metamask's UI features, which are proper targets for our implementations on Puppeteer and Jest respectively. We have found bugs in each evaluation, whose bug reports have been accepted on GitHub. This contributes to both bug fnding and test suite improvement for the applications tested. The results show that our approach to concolic testing frontend JS is both practically and efective.

Acknowledgements. This research received fnancial support in part from National Science Foundation (Grant #: 1908571).

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Democratizing Quality-Based Machine Learning Development through Extended Feature Models?

Giordano d'Aloisio() , Antinisca Di Marco , and Giovanni Stilo

> University of L'Aquila, L'Aquila, Italy giordano.daloisio@graduate.univaq.it {antinisca.dimarco,giovanni.stilo}@univaq.it

Abstract. ML systems have become an essential tool for experts of many domains, data scientists and researchers, allowing them to find answers to many complex business questions starting from raw datasets. Nevertheless, the development of ML systems able to satisfy the stakeholders' needs requires an appropriate amount of knowledge about the ML domain. Over the years, several solutions have been proposed to automate the development of ML systems. However, an approach taking into account the new quality concerns needed by ML systems (like fairness, interpretability, privacy, and others) is still missing.

In this paper, we propose a new engineering approach for the qualitybased development of ML systems by realizing a workflow formalized as a Software Product Line through Extended Feature Models to generate an ML System satisfying the required quality constraints. The proposed approach leverages an experimental environment that applies all the settings to enhance a given Quality Attribute, and selects the best one. The experimental environment is general and can be used for future quality methods' evaluations. Finally, we demonstrate the usefulness of our approach in the context of multi-class classification problem and fairness quality attribute.

Keywords: Machine Learning System · Software Quality · Feature Models · Software Product Line · Low-code development

# 1 Introduction

Machine Learning (ML) systems are increasingly becoming used instruments, applied to all application domains and affecting our real life. The development

<sup>?</sup> This work has been partially supported by EMELIOT national research project, which has been funded by the MUR under the PRIN 2020 program (Contract 2020W3A5FY) and by European Union – Horizon 2020 Program under the scheme "INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities", Grant Agreement n.871042, "SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics" (http://www.sobigdata.eu)

L. Lambers and S. Uchitel (Eds.): FASE 2023, LNCS 13991, pp. 88–110, 2023. https://doi.org/10.1007/978-3-031-30826-0 5

of ML systems usually requires a good knowledge of the underlying ML approaches to choose the best techniques and models to solve the targeted problem. Many methods have been developed in the last years to automate some ML systems development phases and help non-technical users [61,31,34]. However, these techniques do not consider the quality properties essential for ML systems, such as dataset's Privacy, model's Interpretability, Explainability, and Fairness [50,46,12]. Indeed, if we consider the impact that ML applications have in our lives, it is clear how assuring that these quality properties are satisfied is of paramount importance (look for instance at some of the 17 sustainable development goals proposed by the United Nations [51]).

In this paper, we present MANILA (Model bAsed developmeNt of machIne Learning systems with quAlity), a novel approach which will democratize the quality-based development of ML systems by means of a low-code platform [62]. The goal of our approach is to provide an environment for the automatic configuration of experiments that automatically selects the ML System (i.e., ML Algorithm and quality enhancing method) better satisfying a given quality requirement. The requirement is satisfied by finding the best trade-off among the involved quality attributes. This will simplify the work of the data scientist and will make the quality-based development of ML systems also accessible to nontechnical users (in other words, democratize).

Hence, the main contributions of this paper are the following:


This paper is organized as follows: in section 2 we discuss related works related to quality engineering of ML systems.In section 3, we present the selected quality attributes and discuss how they affect ML systems. Section 4 is devoted to presenting a general workflow to choose the ML system achieving the bestgiven quality attributes. This general workflow has been the motivating scenario for MANILA. In section 5, we present MANILA by describing in detail the implemented ExtFM and explaining each step of the quality-based development of ML systems. Section 6 is dedicated to a proof of concept of the developed modelling framework by reproducing a case study. Section 7 describes some threats to validity, and finally, section 8 presents some discussions, describes future work, and wraps up the paper.

## 2 Related Work

The problem of quality assurance in machine learning systems has gained much relevance in the last years. Many articles highlight the needing of defining and formalizing new standard quality attributes for machine learning systems [30,65,70] [50,12,46]. Most of the works in the literature focus either on the identification of the most relevant quality attributes for ML systems or on the formalization of them in the context of ML systems development.

Concerning the identification of quality attributes in ML systems, the authors of [40,72] identify three main components in which quality attributes can be found: Training Data, ML Models and ML Platforms. The quality of Training Data is usually evaluated with properties such as privacy, bias, number of missing values, expressiveness. For ML Model, the authors mean the trained model used by the system. The quality of this component is usually evaluated by fairness, explainability, interpretability, security. Finally, the ML Platform is the implementation of the system, which is affected mostly by security and performance reliability and availability. Muccini et al. identify in [50] a set of quality properties as stakeholders' constraints and highlight the needing of considering them during the Architecture Definition phase. The quality attributes are: data quality, ethics, privacy, fairness, ML models' performance, etc. Martinez-Fern`andez et al. also highlight in [46] the needing of formalizing quality properties in ML systems and to update the software quality requirements defined by ISO 25000 [36]. The most relevant properties highlighted by the authors concern: ML safety, ML ethics, and ML explainability. In our work, we focus on quality properties that arises during the development of ML systems such as, fairness, explainability, interpretability, and dataset's privacy, while we leave other quality properties (e.g., performance) that arises during other phases (e.g., deployment) for future works.

Many solutions have been proposed to formalize and model standard quality assurance process in ML systems. Amershi et al., have been the first authors to identify a set of common steps that identify each ML system development [5]. In particular, each ML system is identified by nine stages that go from data collection and cleaning, to model training and evaluation, and finally to the deployment and monitoring of the ML model. Their work has been the foundation of many subsequent papers on quality modelling of ML systems. CRISP ML (Cross-Industry Standard Process model for Machine Learning) is a process model proposed by Studer et al. [66], extending the more known CRISP DL [45] process model to ML systems. They identify a set of common phases for the building of ML systems namely: Business and Data understanding, Data preparation, Modeling, Evaluation, Deployment, Monitoring and Maintenance. For each phase, the authors identify a set of functional quality properties to guarantee the quality of such systems. Similarly, the Quality for Artificial Intelligence (Q4AI) consortium proposed a set of guidelines [32] for the quality assurance of ML systems for specific domains: generative systems, operational data in process systems, voice user interface system, autonomous driving and AI OCR. For each domain, the authors identify a set of properties and metrics to ensure quality. Concerning the modelling of quality requirements, Azimi et al. proposed a layered model for the quality assurance of machine learning systems in the context of Internet of Things (IoT) [7]. The model is made of two layers: Source Data and ML Function/Model. For the Source Data, a set of quality attributes are defined: completeness, consistency, conformity, accuracy, integrity, timeliness. Machine learning models are instead classified into predictors, estimators and adapters and a set of quality attributes are defined for each of them: accuracy, correctness, completeness, effectiveness, optimality. Each system is then influenced by a subset of quality characteristics based on the type of ML model and the required data. Ishikawa proposed, instead, a framework for the quality evaluation of an ML system [35]. The framework defines these components for ML applications: dataset, algorithm, ML component and system, and, for each of them, proposed an argumentation approach to assess quality. Finally, Siebert et al. [64] proposed a formal modelling definition for quality requirements in ML systems. They start from the process definition in [45] and build a meta-model for the description of quality requirements. The meta-model is made of the following classes: Entity (which can be defined at various levels of abstraction, such as the whole system or a specific component of the system), Property (also expressed at different levels of abstraction), Evaluation and Measure related to the property. Starting from this meta-model, the authors build a tree model to evaluate the quality of the different components of the system. From this analysis, we can conclude that there is a robust research motivation in formalizing and defining new quality attributes for ML systems. Many attempts have been proposed to solve these issues, and several quality properties, metrics and definitions of ML systems can now be extracted from the literature. However a framework that actually guides the data scientist through the development of a ML systems satisfying quality properties is still missing.In this paper, we aim to solve these concerns by proposing MANILA, a novel approach which will democratize the quality-based development of ML systems by means of a low-code platform. In particular, we model a general workflow for the quality-based development of ML systems as a SPL through the ExtFM formalism. Next, we demonstrate how it is possible to generate an actual implementation of such workflow from a low-code experiment configuration and how this workflow is actually able to find the best methods to satisfy a given quality requirement. Recalling the ML development process of [5], MANILA focuses on the model training and model evaluation development steps by guiding the data scientist in selecting the ML system (i.e., ML algorithm and quality-enhancing method) better satisfying a given quality attribute.

Concerning the adoption of Feature Models to model ML systems, a similar approach has been used by Di Sipio et al. in [24]. In their work, the authors use Feature Models to model ML pipelines for Recommender Systems. The variation points are identified by all the components needed to implement a recommender system (e.g., the ML algorithm to use or the python libraries for the implementation). However, they do not consider quality attributes in their approach.

Finally, concerning assessing quality attributes in ML systems, there is an intense research activity primarily related to the fairness-testing domain [20]. In general, the problem of fairness assurance can be defined as a search-based problem among different ML algorithms and fairness methods [20]. Many tools have been proposed for the automatic fairness test, such as [18,63,69] to cite a few. However, these tools tend to require programming skills and thus are unfriendly to nontechnical stakeholders [20]. In our work, we aim to fill this gap by proposing a low-code framework that, generating and executing suitable experiments, supports (also not expert) users in the quality-based development of ML systems, by returning the trained ML model with best quality.

### 3 Considered Quality Attributes

In software engineering, a quality requirement specifies criteria that can be used to quantify or qualify the operation of a system rather than to specify its behaviours [19]. To analyse an ML system from a qualitative perspective, we must determine the Quality Attributes (QA) that we can use to judge the system's operation, influencing the ML designers' decisions. We refer to the literature for ML systems to identify the QA to consider [46,50,30,40,70]. In this work, we consider a sub-set of the identified QA, i.e., Effectiveness, Fairness, Interpretability, Explainability, and Privacy.

Effectiveness. This QA is used to define how good the model must be in predicting outcomes [13]. There are different metrics in the literature to address the Effectiveness of an ML model. Among the most common metrics, we cite Precision: fraction of true positives (TP) to the total positive predictions [14]; Recall: fraction of TP to the total positive items in the dataset [14]; F1 Score: harmonic mean of Precision and Recall [67]; Accuracy: fraction of True Positives (TP) and True Negatives (TN) above the total of predictions [60]. This attribute can be considered crucial in developing an ML system and must always be accounted in the quality evaluation of ML systems [72,13].

Fairness. A ML model can be defined fair if it has no prejudice or favouritism towards an individual or a group based on their inherent or acquired characteristics identified by the so-called sensitive variables [47]. Sensitive variables are variables of the dataset that can cause prejudice or favouritism towards individuals having a particular value of that variable (e.g., sex is a very common sensitive variable, and women can be identified as the unprivileged group [47,16,42]). Several metrics can assess the discrimination of an ML system towards sensitive groups (group fairness metrics) or single individuals (individual-fairness metrics) [47,16].

Interpretability. Interpretability can be defined as the ability of a system to enable user-driven explanations of how a model reaches the produced conclusion [15]. Interpretability is one QA that can be estimated without executing an actual ML system. Indeed, ML methods are classified as whitebox, i.e., interpretable (e.g., Decision Trees or linear models), and black-box, i.e., not interpretable (e.g., Neural Networks) [49]. Interpretability is a very strong property that can hold only for white-box approaches (such as decision trees). Instead, black-box methods (such as neural networks) require the addition of explainability-enhancing methods to have their results interpretable [43].

Explainability. Explainability can be defined as the ability to make blackbox methods' results (which are not interpretable) interpretable [43]. Enhancing the Interpretability of black-box methods has become crucial to guarantee the trustworthiness of ML systems, and several methods have been implemented for this purpose [43]. The quality of explanations can be measured with several metrics that can be categorised as application-grounded metrics, which involve an evaluation of the explanations with end-users, human-grounded metrics, which include evaluations of explanations with non-domain-experts, and functionallygrounded metrics, which use proxies based on a formal definition of interpretability [73].

Privacy. Privacy can be defined as the susceptibility of data or datasets to revealing private information [21]. Several metrics can assess the ability to link personal data to an individual, the level of detail or correctness of sensitive information, background information needed to determine private information, etc [71].

### 4 Motivating Scenario

Today, a data scientist, required to realize an ML system satisfying a given quality constraint, has no automatic support in the development process. Indeed, she follows and manually executes a general experiment workflow aiming at evaluating a set of ML systems obtained by assembling quality assessment and improvement algorithms with the ones solving the specific ML tasks. By running the defined experiment, she aims to find the optimal solution satisfying a given QA constraint.

Algorithm 1 reports the pseudo-code of a generic experiment to assess a generic QA during the development of an ML system. This code has been derived from our previous experience in the quality-based development of ML Systems and by asking researchers studying ML development and quality assessment how they evaluate such properties during ML systems development.

The first step in the experiment workflow is selecting the dataset to use (in this work, we assume that the dataset has already been preprocessed and is ready to train the ML model). Next, the data scientist selects the ML algorithms, the methods enhancing a QA, and the appropriate quality metrics for the evaluation. Then, for each of the chosen ML algorithms, she applies the selected quality methods accordingly to their type, there can be the following options:

### Algorithm 1: Quality-evaluation experiment pseudo-code



Finally, the data scientist computes the selected metrics for the specific pair of ML and QA methods. After repeating the process for all the selected methods, she chooses a report technique (e.g., table or chart), evaluates the obtained results collected in the report and trains with the entire dataset the ML algorithm performing better by applying the quality method that better achieves the QA. If the data scientist has a threshold to achieve, then she can verify if at least one of the ML and quality methods combinations satisfies the constraint. If so, one of the suitable pair is selected. Otherwise, she has to relax the threshold and repeat the process again.

The workflow described in Algorithm 1 can be generalized as a process of common steps describing any experiment in the considered domain. Figure 1 sketches such a generalization. First, the data scientist selects all the features of the experiment, i.e., the dataset, the ML Methods, the methods assuring a specific QA and the related metrics. we call such a step Features Selection. Next, she runs the quality methods using the general approach described in algorithm 1 and evaluates the results (namely, Experiment Execution). If the results are

Fig. 1: Manual execution of the quality experiment workflow

satisfying (i.e., they satisfy the quality constraints), then the method with the best QA is returned. Otherwise, the data scientist have to repeat the process.

The described workflow is the foundation of MANILA that aims to formalise and democratise it by providing a SPL and ExtFM-based low-code framework that supports her in development of quality ML systems.

### 5 MANILA Approach

In this section, we describe MANILA, a framework to formalise and democratise the quality-based development of ML systems. This work is based on the quality properties and the experiment workflow described in sections 3 and 4, respectively.

Our approach aims to automate and ease the quality-based development of ML systems. We achieve this goal by proposing a framework to automatically generate a configuration of an experiment to find the ML system (i.e., ML algorithm and quality enhancing method) better satisfying a given QA. This framework will accelerate the quality-based development of ML systems making it accessible also to not experts.

Recalling the experimental workflow described in section 4, the set of ML models, quality methods and metrics can be considered variation points of each experiment, differentiating them from one another. For this reason, we can think of this family of experiments as a Software Product Line (SPL) specified by a Feature Model [6]. Indeed, Feature Models allow us to define a template for families of software products with standard features (i.e., components of the final system) and a set of variability points that differentiate the final systems [38,29]. Features in the model follow a tree-like parent-child relationship and could be mandatory or optional [29]. Sibling features can belong to an Orrelationship or an Alternative-relationship [29]. Finally, there could be Cross-tree relationships among features not in the same branch. These relationships are expressed using logical propositions [29]. However, traditional Feature Models do not allow associating attributes to features, which are necessary in our case to represent a proper experiment workflow (for instance, to specify the label of the dataset or the number of rounds in a cross-validation [58]). Hence, we relied on the concept of Extended Feature Models [38,9] to represent the family of experiments workflows.

Fig. 2: MANILA approach

Figure 2 details a high-level picture of MANILA, where each rounded box represents a step in the quality-driven development process, while square boxes represent artefacts. Dotted blocks represent steps which have not been implemented yet and will be considered in future works.

The basis of MANILA is the Extended Feature Model (ExtFM), based on the existing ExtFM Meta-Model. The ExtFM is the template of all possible experiments a data scientist can perform and guides her through the qualitybased development of an ML system. The first step in the development process is the features selection, in which the data scientist selects all the components of the quality-testing experiment. Next, a Python script implementing the experiment is automatically generated from the selected features. Finally, the experiment is executed, and for each QA selected, it returns:


In the future, MANILA will analyse the quality reports of each selected QA in order to find the best trade-off among them (for instance, by means of Paretofront functions). The architecture of MANILA makes it easy to extend. In fact, adding a new method or metric to MANILA just translates to adding a new feature to the ExtFM and adding the proper code implementing it.

Near each step, we report the tools involved in its implementation. The source code of the implemented artefacts is available on Zenodo [23], and GitHub [22]. In the following, we detail the ExtFM and each process step.

### 5.1 Extended Feature Model

As already mentioned, the ExtFM is the basis of MANILA approach since it defines the template of all possible experiments a data scientist can generate. It has been implemented using FeatureIDE, an open-source graphical editor which allows the definition of ExtFMs [68]. Figure 3 shows a short version of the im-

Fig. 3: Short version of the implemented Extended Feature Model

plemented ExtFM<sup>1</sup> . In particular, each experiment is defined by seven macro features, which are then detailed by children's features.

The first mandatory feature is the Dataset. The Dataset has a file extension (e.g., CSV, EXCEL, JSON, and others), and a Label which can be Binary or Multi-Class. The Label feature has two attributes specifying his name and

<sup>1</sup> The whole picture can be downloaded here https://anonymous.4open.science/r/ manila-101D/imgs/feature-model.png

the positive value (used to compute fairness metrics). The Dataset could also have one or more sensitive variables that identify sensitive groups subject to unfairness [47]. The sensitive variables have a set of attributes to specify their name and the privileged and unprivileged groups [47]. Finally, there is a feature to specify if the Dataset has only positive attributes. This feature has been included to define a cross-tree constraint with a scaler technique that requires only positive attributes (see table 1). All these features are modelled as abstract since they do not have a concrete implementation in the final experiment. The next feature is a Scaler algorithm, which is not mandatory and can be included in the experiment to scale and normalize the data before training the ML model [54]. Different scaler algorithms from the scikit-learn library [55] are listed as concrete children of this feature. Next, there is the macro-feature representing the ML Task to perform. This feature has not been modelled as mandatory since there are two fairness methods (i.e. Gerry Fair and Meta Fair [39,17]) that embed a fair classification algorithm and so, if they are selected, the ML Task can not be specified. However, we included a cross-tree constraint requiring the selection of ML Task if any of these two methods are selected (¬ Gerry Fair ∧ ¬ Meta Fair ⇒ ML Task). An ML Task could be Supervised or Unsupervised. A Supervised task could be a Classification task or a Regression task and has an attribute to specify the size of the training set. These two abstract features are then detailed by a set of concrete implementations of ML methods selected from the scikit-learn library [55]. The Unsupervised learning task could be a Clustering or an Aggregation task. At this stage of the work, these two features have not been detailed and will be explored in future works. Next is the macro feature representing the system's Quality Attributes. This feature is detailed by the four quality attributes described in section 3. Effectiveness is not included in these features since it is an implicit quality of the ML methods and does not require adding other components (i.e. algorithms) in the experiment. At the time of this paper, the Fairness quality has been detailed, while the other properties will be deepened in future works. In particular, Fairness methods can be Pre-Processing (i.e. strategies that try to mitigate the bias on the dataset used to train the ML model [47,37,27]), In-Processing (i.e. methods that modify the behaviour of the ML model to improve fairness [47,3]), and Post-Processing (i.e. methods that re-calibrate an already trained ML model to remove bias [47,56]). These three features are detailed by several concrete features representing fairnessenhancing methods. In selecting such algorithms, we selected methods with a solid implementation, i.e., algorithms integrated into libraries such as AIF360 [8] or Fairlearn [11] or algorithms with a stable source code such as DEMV [26] or Blackbox [56]. All these quality features have been implemented with an Or-group relationship. Forward, the macro feature represents the Metrics to use in the experiment. Metrics are divided among Classification Metrics, Regression Metrics and Fairness Metrics. Each metric category has a set of concrete metrics selected from the scikit-learn library [55] and the AIF360 library [8]. Based on the ML Task and the Quality Attributes selected, the data scientist must select the proper metrics to assess Correctness and the other Quality Attributes. This constraint is formalized by cross-tree relationships among features (see table 1). In addition, a set of Aggregation Functions must be selected if more than one metric is selected. The aggregation function combines the value of the other metrics to give an overall view of the method's behaviour. Forward, there is the optional macro feature identifying the Validation function. Validation functions are different strategies to evaluate the Quality Attributes of an ML model [57]. Several Validation functions are available as children features, and there is an attribute to specify the number of groups in case of cross-validation [57]. The last macro-feature is related to the presentation of the results. Recalling the experiment workflow described in section 4, the results are the metrics' values derived from the execution of the experiment. The results can be presented in a tabular way or using proper charts. Different chart types are available as concrete children features. Finally, table 1 lists the cross-tree constraints defined



in our model. These constraints are useful to guide the data scientist through selecting proper fairness-enhancing methods or metrics based on the Dataset's characteristics (i.e., label type or the number of sensitive variables) or the ML Task.

### 5.2 Features Selection

From the depicted ExtFM, the data scientist can define her experiment by specifying the needed features inside a configuration file. A configuration file is an XML file describing the set of selected features and the possible attribute values. The constraints among features defined in the ExtFM will guide the data scientist in the selection by not allowing the selection of features that are in contrast with already selected ones. The editor used to implement the ExtFM [68] provides a GUI for the specification of configuration files, making this process accessible to non-technical users.

Fig. 4: Feature selection and attribute specification process

Figure 4 depicts how the features selection and attribute specification processes are done in MANILA. In particular, figure 4a details how the features of the Dataset are selected inside the configuration. Note how features in contrast with already selected ones are automatically disabled by the system (e.g., the Binary feature is disabled since the MultiClass feature is selected). This automatic cut of the ExtFM guides the data scientist in defining configurations that always lead to valid (i.e., executable) experiments. Figure 4b details how attributes can be specified during the definition of the configuration. In particular, the rightmost column in figure 4b displays the attribute value specified by the data scientist (e.g., the name of the label is y, and the positive value is 2). During the experiment generation step, a process will automatically check if all the required attributes (e.g., label name) have been defined. Otherwise, it will ask the data scientist to fill them.

### 5.3 Experiment generation

From the XML file describing an experiment configuration, it is possible to generate a Python script implementing the defined experiment.

```
< feature automatic =" selected " manual =" undefined " name =" Dataset "/
   >
< feature automatic =" selected " manual =" undefined " name =" Label ">
    < attribute name =" Positive value " value ="2"/ >
    < attribute name =" Name " value =" contr_use "/ >
</ feature >
< feature automatic =" unselected " manual =" undefined " name =" Binary "
   / >
```

```
< feature automatic =" undefined " manual =" selected " name ="
   MultiClass "/ >
```
Listing 1.1: Portion of configuration file

Listing 1.1 shows a portion of the configuration file derived from the feature selection process. In particular, it can be seen how the Dataset and the Label features have been automatically selected by the system (features with name="Dataset" and name="Label" and automatic="selected"), the Multi-Class feature has been manually selected by the data scientist (feature with name="MultiClass" and manual="selected"), and the Binary feature was not selected (feature with name="Binary" and both automatic and manual unselected). In addition, the name and the value of two Label attributes (i.e., Positive value equal to 2 and Name equal to contr use) are reported.

The structure of the configuration file makes it easy to be parsed by a proper script. In MANILA, we implemented a Python parser that reads the configuration file given as input and generates a set of scripts implementing the defined experiment. The parser can be invoked using the Python interpreter with the following command shown in listing 1.2.

### \$ python generator . py -n < CONFIGURATION FILE PATH >

Listing 1.2: Python parser invocation

In particular, the parser first checks if all the required attributes (e.g., the label's name) are set. If some of them are not set, it asks the data scientist to fill them in before continuing the parsing. Otherwise, it selects all the features with automatic="selected" or manual="select" and uses them to fill a Jinja2 template [53]. The generated quality-evaluation experiment follows the same structure of algorithm 1. It is embedded inside a Python function that takes as input the dataset to use (listing 1.3). An example of a generated file can be accessed on the GitHub [22] or Zenodo [23] repository.

```
def experiment ( data ) :
    # quality evaluation experiment
```
Listing 1.3: Quality-testing experiment signature

In addition to the main file, MANILA generates also a set of Python files needed to execute the experiment and an environment.yml file containing the specification of the conda [1] environment needed to perform the experiment. All the files are generated inside a folder named gen.

### 5.4 Experiment Execution

The generated experiment can be invoked directly through the Python interpreter using the command given in listing 1.4. Otherwise, it can be called through a REST API or any other interface such as a desktop application, or a Scientific Workflow Management System like KNIME [44,10]. This generality of our experimental workflow, makes it very flexible and suitable to many use-cases.

#### \$ python main . py -d < DATASET PATH >

Listing 1.4: Experiment invocation

The experiment applies each ML algorithm with each quality method and returns a report using the adequate selected metrics along with the method achieving the best QA. It is worth noting how each quality method is evaluated individually on the selected ML algorithm, and for each QA, a corresponding report is returned by the system. Figure 5 reports an example of how the quality

Fig. 5: Quality evaluation process example

evaluation process is done in MANILA. In this example, the data scientist has selected three ML algorithms and wants to assure Fairness and Explainability. She has selected n methods to assure Fairness and m methods to assure Explainability. In addition, she has selected j metrics for Fairness and k metrics for Explainability. Then, the testing process performs two parallel set of experiments. In the first, it applies the n fairness methods to each ML algorithm accordingly and computes the j fairness metrics. In the second, it applies the m Explainability methods to the ML algorithms and computes the k Explainability metrics. Finally, the process returns two reports synthesising the obtained results for Fairness and Explainability along with the ML algorithms with the best Fairness and Explainability, respectively. If the data scientist chooses to see the results in tabular form (i.e., selects the Tabular feature in the ExtFM), then the results are saved in a CSV file. Otherwise, the charts displaying the results are saved as PNG files. The ML algorithm returned by the experiment is instead saved as a pickle file [2]. We have chosen this format since it is a standard format to store serialized objects in Python and can be easily imported in other scripts.

Finally, it is worth noting how the generated experiment workflow is written in Python and can be customised to address particular stakeholders' needs or evaluate other quality methods.

### 6 Proof of Concept

To prove the ability of MANILA in supporting the quality-based development of ML systems, we implemented with MANILA a fair classification system to predict the frequency of contraceptive use by women, using a famous dataset in the Fairness literature [42]. This use case is reasonable since fairness has acquired much importance in recent years, partly because of the sustainable goals of the UN [51]. The first step in the quality development process is feature

Fig. 6: Dataset specification

selection. The ML task to solve is a multi-class classification problem [4], hence in the ExtFM we selected the feature MultiClass for the Label and we specified its name and the positive value to consider for the fairness evaluation (longterm use). We will use a CSV dataset file, so we specified this feature in the configuration. Finally, accordingly to the literature [42], we specified that the dataset has multiple sensitive variables to consider for fairness, and we specified their names and privileged and unprivileged values. Figure 6 reports the selected features of the Dataset and the attributes specified.

Next, we specified that we want to use a Standard Scaler algorithm to normalize the data and we selected the following ML algorithms for classification: Logistic Regression[48], Support Vector Classifier [52], and Gradient Boosting Classifier [28]. Figure 7 reports the Fairness methods we want to test. Note how many methods have been automatically disabled by the system based on the features already selected<sup>2</sup> . Further, we specified the metrics we want to use to evaluate Fairness and Effectiveness: Accuracy [60], Zero One Loss [25], Disparate Impact [27], Statistical Parity [41], and Equalized Odds [33], and the Harmonic Mean as aggregation function (we have chosen this aggregation function since it is widely used in the literature). Finally, we specified that we want to perform a 10-fold cross validation [59] and that we want the results in tabular form without the

<sup>2</sup> In particular, these methods have been disabled because they do not support multiclass classification or multiple sensitive variables

Fig. 7: Selected Fairness methods

generation of a chart. From the given configuration, MANILA generates all the python files needed to run the quality-assessment experiment. In particular, the generated experiment trains and tests all the selected ML algorithm (i.e., Logistic Regression, Support Vector Classifier, and Gradient Boosting Classifier ) applying all the selected fairness methods properly (i.e., DEMV, Exponentiated Gradient, and Grid Search). Finally, it computes the selected metrics on the trained ML algorithms and returns a report of the metrics along with the fully trained ML algorithm with the best fairness. All the generated files are available on Zenodo [23] and Github [22]. The generated experiment was executed di-


Table 2: Generated results

rectly from the python interpreter, and the obtained results are available in table 2. In the table are reported the Fairness enhancing methods, the ML algorithms and all the metrics computed. The table has been automatically ordered based on the given aggregation function (i.e., the rightmost column HMean). From the results, we can see that the Support Vector Classifier (i.e., svc in the table) and the DEMV fairness method can achieve the best Fairness and Effectiveness trade-off, since they have the highest HMean value (highlighted in green in table 2). Hence, the ML algorithm returned by the experiment is the Support Vector Classifier, trained with the full dataset after the application of the DEMV algorithm.

### 7 Threats to Validity

Although the QA considered in MANILA are the most relevant and the most cited in the literature, there could be other QA highly affecting the environment/end users of the ML system that are not focused prominently by existing papers. In addition, the proposed experimental workflow is based on the considered QA; there could be other QA not considered at the time of this paper that should be evaluated differently.

### 8 Conclusion and Future Work

In this paper, we have presented MANILA, a novel approach to democratize the quality-based development of ML systems. First, we have identified the most influential quality properties in ML systems by selecting the quality attributes that are most cited in the literature. Next, we have presented a general workflow for the quality-based development of ML systems. Finally, we described MANILA in detail by first explaining how the general workflow can be formalized through an ExtFM. Next, we detailed all the steps required to develop a quality ML system using MANILA. We started from the low-code configuration of the experiment to perform; we described how a Python implementation could be generated from such a configuration. Finally, we showed how the execution of the experiment could identify the method better satisfying a given quality requirement. We have demonstrated the ability of MANILA in guiding the data scientists through the quality-based development of ML systems by implementing a fair multi-class classification system to predict the use of contraceptive methods by women.

In future, we plan to improve MANILA by extending the ExtFM with additional methods enhancing other quality attributes and by implementing in the framework the trade-off analysis that combines the different quality attribute evaluations when required by means of Pareto-front functions. MANILA appears to be easy to use and very general, able to embed different quality attributes that are quantitatively measured. To demonstrate our intuition, we will conduct a user evaluation of MANILA, to evaluate its usability by involving experts and not experts of the quality ML system development. Some groups we aim to involve are: master students in computer science and applied data science (i.e., non-expert users), data scientists working in industries, and researchers studying ML development and quality assessment (i.e., expert users). In addition, since MANILA supports the configuration of an experiment by running all possible combinations of the selected features, a limit of the proposed approach can be its complexity and the time needed to obtain the results. Such limitation is mitigated by the feature selection step, which demands the user to choose which features to include in the experiment. As future work, to enlarge the MANILA usage, we will better study such aspects and provide guidelines to the users on how to mitigate such potential limitations.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Efficient Bounded Exhaustive Input Generation from Program APIs

Mariano Politano1,4() , Valeria Bengolea<sup>1</sup> , Facundo Molina<sup>3</sup> , Nazareno Aguirre1,<sup>4</sup> , Marcelo F. Frias2,<sup>4</sup> , and Pablo Ponzio1,<sup>4</sup>

<sup>1</sup> Universidad Nacional de Río Cuarto, Río Cuarto, Argentina mpolitano@dc.exa.unrc.edu.ar

2 Instituto Tecnológico de Buenos Aires, Buenos Aires, Argentina 3 IMDEA Software Institute, Madrid, Spain <sup>4</sup> CONICET, Buenos Aires, Argentina

Abstract. Bounded exhaustive input generation (BEG) is an effective approach to reveal software faults. However, existing BEG approaches require a precise specification of the valid inputs, i.e., a repOK, that must be provided by the user. Writing repOKs for BEG is challenging and time consuming, and they are seldom available in software.

In this paper, we introduce BEAPI, an efficient approach that employs routines from the API of the software under test to perform BEG. Like API-based test generation approaches, BEAPI creates sequences of calls to methods from the API, and executes them to generate inputs. As opposed to existing BEG approaches, BEAPI does not require a repOK to be provided by the user. To make BEG from the API feasible, BEAPI implements three key pruning techniques: (i) discarding test sequences whose execution produces exceptions violating API usage rules, (ii) state matching to discard test sequences that produce inputs already created by previously explored test sequences, and (iii) the automated identification and use of a subset of methods from the API, called builders, that is sufficient to perform BEG.

Our experimental assessment shows that BEAPI's efficiency and scalability is competitive with existing BEG approaches, without the need for repOKs. We also show that BEAPI can assist the user in finding flaws in repOKs, by (automatically) comparing inputs generated by BEAPI with those generated from a repOK. Using this approach, we revealed several errors in repOKs taken from the assessment of related tools, demonstrating the difficulties of writing precise repOKs for BEG.

### 1 Introduction

Automated test generation approaches aim at assisting developers in crucial software testing tasks [2,22], like automatically generating test cases or suites [6,18,10], and automatically finding and reporting failures [23,19,12,20,4,13]. Many of these approaches involve random components, that avoid making a systematic exploration of the space of behaviors, but improve test generation efficiency [23,19,10]. While these approaches have been useful in finding a large number of bugs in software, they might miss exploring certain faulty software behaviors due to their random nature. Alternative approaches aim at systematically exploring a very large number of executions of the software under test (SUT), with the goal of providing stronger guarantees about the absence of bugs [20,4,12,14,6,18]. Some of these approaches are based on bounded exhaustive generation (BEG) [20,4], which consists of generating all feasible inputs that can be constructed using bounded data domains. Common targets to BEG approaches have been implementations of complex, dynamic data structures with rich structural constraints (e.g., linked lists, trees, etc). The most widely-used and efficient BEG approaches for testing software [20,4] require the user to provide a formal specification of the constraints that the inputs must satisfy –often a representation invariant of the input (repOK)–, and bounds on data domains [20,4] –often called scopes. Thus, specification-based BEG approaches yield all inputs within the provided scopes that satisfy repOK.

Writing appropriate formal specifications for BEG is a challenging and time consuming task. The specifications must precisely capture the intended constraints of the inputs. Overconstrained specifications lead to missing the generation of valid inputs, which might make the subsequent testing stage miss the exploration of faulty behaviors of the SUT. Underconstrained specifications may lead to the generation of invalid inputs, which might produce false alarms while testing the SUT. Furthermore, sometimes the user needs to take into account the way the generation approach operates, and write the specifications in a very specific way for the approach to achieve good performance [4] (see Section 4). Finally, such precise formal specifications are seldom available in software, hindering the usability of specification-based BEG approaches.

Several studies show that BEG approaches are effective in revealing software failures [20,16,4,33]. Furthermore, the small scope hypothesis [3], which states that most software faults can be revealed by executing the SUT on "small inputs", suggests that BEG approaches should discover most (if not all) faults in the SUT, if large enough scopes are used. The challenge that BEG approaches face is how to efficiently explore a huge search space, that often grows exponentially with respect to the scope. The search space often includes a very large number of invalid (not satisfying repOK) and isomorphic inputs [15,28]. Thus, pruning parts of the search space involving invalid and redundant inputs is key to make BEG approaches scale up in practice [4].

In this paper, we propose a new approach for BEG, called BEAPI, that works by making calls to API methods of the SUT. Similarly to API-based test generation approaches [23,19,10], BEAPI generates sequences of calls to methods from the API (i.e., test sequences). The execution of each test sequence yielded by BEAPI generates an input in the resulting BEG set of objects. As usual in BEG, BEAPI requires the user to provide scopes for generation, which for BEAPI includes a maximum test sequence length. Brute force BEG from a user-provided scope would attempt to generate all feasible test sequences of methods form the API with up to a maximum sequence length. This is an intrinsically combinatorial process, that exhausts computational resources before completion even for very small scopes (see Section 4). We propose several pruning techniques that are crucial for the efficiency of BEAPI, and allow it to scale up to significantly larger scopes. First, BEAPI executes test sequences and discards those that correspond to violations of API usage rules (e.g., throwing exceptions that indicate incorrect API usage, such as IllegalArgumentException in Java [17,23]). Thus, as opposed to specification-based BEG approaches, BEAPI does not require a repOK that precisely describes valid inputs. In contrast, BEAPI requires minimum specification effort in most cases (including most of our case studies in Section 4), which consists of making API methods throw exceptions on invalid inputs (in the "defensive programming" style popularized by Liskov [17]). Second, BEAPI implements state matching [15,28,36] to discard test sequences that produce inputs already created by previously explored sequences. Third, BEAPI employs only a subset of the API methods to create test sequences: a set of methods automatically identified as builders [27]. Before test generation, BEAPI executes an automated builders identification approach [27] to find a smaller subset of the API that is sufficient to yield the resulting BEG set of inputs. Another advantage of BEAPI with respect to specification-based approaches is that it produces test sequences to create the corresponding inputs using methods from the API, making it easier to create tests from BEAPI's output [5].

We experimentally assess BEAPI, and show that its efficiency and scalability are comparable to those of the fastest BEG approach (Korat), without the need for repOKs. We also show that BEAPI can be of help in finding flaws in repOKs, by comparing the sets of inputs generated by BEAPI using the API against the sets of inputs generated by Korat from a repOK. Using this procedure, we found several flaws in repOKs employed in the experimental assessment of related tools, thus providing evidence on the difficulty of writing repOKs for BEG.

### 2 A Motivating Example

To illustrate the difficulties of writing formal specifications for BEG, consider Apache's NodeCachingLinkedList's (NCL) representation invariant shown in Figure 1 (taken from the ROOPS benchmark<sup>5</sup> ). NCLs are composed of a main circular, doubly-linked list, used for data storage, and a cache of previously used nodes implemented as a singly linked list. Nodes removed from the main list are moved to the cache, where they are saved for future usage. When a node is required for an insertion operation, a cache node (if one exists) is reused (instead of allocating a new node). As usual, repOK returns true iff the input structure satisfies the intended NCL properties [17]. Lines 1 to 20 check that the main list is a circular doubly-linked list with a dummy head; lines 21 to 33 check that the cache is a null terminated singly linked list (and the consistency of size fields is verified in the process). This repOK is written in the way recommended by the authors of Korat [4]. It returns false as soon as it finds a violation of an intended property in the current input. Otherwise, it returns true at the end. This allows Korat to prune large portions of the search space, and improves its

<sup>5</sup> https://code.google.com/p/roops/

```
1 public boolean repOK() {
2 if (this.header == null) return false;
3 // Missing constraint: the value of the sentinel node must be null
4 // if (this.header.value != null) return false;
5 if (this.header.next == null) return false;
6 if (this.header.previous == null) return false;
7 if (this.cacheSize > this.maximumCacheSize) return false;
8 if (this.size < 0) return false;
9 int cyclicSize = 0;
10 LinkedListNode n = this.header;
11 do {
12 cyclicSize++;
13 if (n.previous == null) return false;
14 if (n.previous.next != n) return false;
15 if (n.next == null) return false;
16 if (n.next.previous != n) return false;
17 if (n != null) n = n.next;
18 } while (n != this.header && n != null);
19 if (n == null) return false;
20 if (this.size != cyclicSize - 1) return false;
21 int acyclicSize = 0;
22 LinkedListNode m = this.firstCachedNode;
23 Set visited = new HashSet();
24 visited.add(this.firstCachedNode);
25 while (m != null) {
26 acyclicSize++;
27 if (m.previous != null) return false;
28 // Missing constraint: the value of cache nodes must be null
29 // if (m.value != null) return false;
30 m = m.next;
31 if (!visited.add(m)) return false;
32 }
33 if (this.cacheSize != acyclicSize) return false;
34 return true;
35 }
```
Fig. 1. NodeCachingLinkedList's repOK from ROOPS

performance [4]. repOK suffers from underspecification: it does not state that the sentinel node and all cache nodes must have null values (lines 3-4 and 28-29, respectively). Mistakes like these are very common when writing specifications (see Section 4.3), and difficult to discover by manual inspection of repOK. These errors can have serious consequences for BEG. Executing Korat with repOK and a scope of up to 8 nodes produces 54.5 million NCL structures, while the actual number of valid NCL instances is 2.8 million. Clearly, this is a problem for Korat's performance, and for the subsequent testing of the SUT. In addition, the invalid instances generated might trigger false alarms in the SUT in many cases. We discovered these errors in repOK with the help of BEAPI: we automatically contrasted the structures generated using BEAPI and the NCL's API, with those generated using Korat with repOK, for the same scope.

This example shows that writing sound and precise repOKs for BEG is difficult and time consuming. Fine-tuning repOKs to improve the performance of BEG (e.g., for Korat) is even harder. The main advantage of BEAPI is that it requires minimal specification effort to perform BEG. If API methods used for generation are correct, all generated structures are valid by construction. The programmer only needs to make sure that API methods throw exceptions when API usage

```
1 max.objects=3
2 int.range=0:2
3 # strings=str1,str2,str3
4 # omit.fields=NodeCachingLinkedList.DEFAULT_MAXIMUM_CACHE_SIZE
```
Fig. 2. BEAPI's scope definition for NCL (max. nodes 3)

rules are violated, in a defensive programming style [17]. In most cases, this requires checking very simple conditions on the inputs. In our example, the method to add an element to a NCL throws an IllegalArgumentException when is called with the null element (the implementation of the method takes care that the remaining NCL properties hold).

### 3 Bounded Exhaustive Generation from Program APIs

We now describe BEAPI's approach. We start with the definition of scope, then present BEAPI's optimizations, and we finally describe BEAPI's algorithm.

### 3.1 Scope Definition

The definition of scope in Korat involves providing bounded data domains for classes and fields of the SUT, since Korat explores the state space of feasible input candidates, and yields the set of inputs satisfying repOK as a result. Instead, BEAPI explores the search space of (bounded) test sequences that can be formed by making calls to the SUT's API. Thus, we have to provide data domains for the primitive types employed to make such calls, and a bound on the maximum size of the structures we want to keep, from those generated by such API calls. An example configuration file defining BEAPI's scope for the NCL case study is shown in Figure 2. The max.objects parameter specifies the maximum number of different objects (reachable from the root) that a structure is allowed to have. Test sequences that create a structure with a larger number of different objects (of any class) than max.objects will be discarded (and the structure too). In our example, this implies that BEAPI will not create NCLs with more than 3 nodes. Next, one has to specify the values that will be employed by BEAPI to invoke API routines that take primitive type parameters (e.g., elements to insert into the list). The int.range parameter allows one to specify a range of integers, which goes from 0 to 2 in Figure 2. One may also specify domains for other primitive types like floats, doubles and strings, by describing their values by extension. For example, line 3 shows how to define str1, str2 and str3 as the feasible values for String-typed parameters. Also, we can instruct BEAPI which fields to take into account for structure canonicalization, or which fields to omit (omit.fields). This allows the user to control the state matching process (see Section 3.2). For example, uncommenting line 4 would make BEAPI omit the DEFAULT\_MAXIMUM\_CACHE\_SIZE in state matching, which in our example is a constant initialized to 20 in the class constructor. In this case, omitting the field does not change anything in terms of the different structures generated by

BEAPI, but in other cases omitting fields may have an impact. The configuration in Figure 2 is enough for BEAPI to generate NCLs with a maximum of 3 nodes, containing integers from 0 to 2 as values, which allowed us to mimic the structures generated by Korat for the same scope.

### 3.2 State Matching

In test generation with BEAPI, multiple test sequences often produce the same structure, e.g., inserting an element into a list and removing the element afterwards. BEAPI assumes that method executions are deterministic: any execution of a method with the same inputs yields the same results. For the generation of a bounded exhaustive set of structures, for each distinct structure s in the set, BEAPI only needs to save the first test sequence that generates s. All test sequences generated subsequently that also create s can be discarded. As BEAPI works by extending previously generated test sequences (Section 3.4), if we save many test sequences for the same structure, all these sequences would have to be extended with new routines in subsequent iterations of BEAPI, resulting in unnecessary computations. Hence, we implement state matching on BEAPI as follows. We store all the structures produced so far by BEAPI in a canonical form (see below). After executing the last routine r(p1,..,pk) of a newly generated test sequence T, we check whether any of r's parameters hold a structure not seen before (not stored). If T does not create any new structure, it is discarded. Otherwise, T and the new structures it generates are stored by BEAPI.

We represent heap-allocated structures as labeled graphs. After the execution of a method, a (non-primitive typed) parameter p holds a reference to the root object r of a rooted heap (i.e. p = r), defined below.

Definition 1. Let O be a set of objects, and P a set of primitive values (including null). Let F be the fields of all objects in O.


The special case p = null can be represented by a rooted heap with a dummy node and a dummy field pointing to null. In languages without explicit memory management (like Java), each object is identified by the memory address where is allocated. But changing the memory addresses of objects (while keeping the same graph structure) has no effect in the execution of a program. Heaps obtained by permutations of the memory addresses of their component objects are called isomorphic heaps. We avoid the generation of isomorphic heaps by employing a canonical representation for heaps [15,4]. Rooted heaps can be efficiently canonicalized by an approach called linearization [15,36], which transforms a rooted heap into a unique sequence of values.

Figure 3 shows the linearization algorithm used by BEAPI, a customized version that reports when objects exceed the scopes and supports ignoring object

```
1 int[] linearize(O root, Heap<O, E> heap, int scope, Regex omitFields) {
2 Map ids = new Map(); // maps nodes into their unique ids
3 return lin(root, heap, scope, ids, omitFields);
4 }
5 int[] lin(O root, Heap<O, E> heap, int scope, Map ids, Regex omitFields) {
6 if (ids.containsKey(root))
7 return singletonSequence(ids.get(root));
8 if (ids.size() == scope)
9 throw new ScopeExceededException();
10 int id = ids.size() + 1;
11 ids.put(root, id);
12 int[] seq = singletonSequence(id);
13 Edge[] fields = sortByField({ <root, f, o> in E }, omitFields);
14 foreach (<root, f, o> in fields) {
15 if (isPrimitive(o))
16 seq.add(uniqueRepresentation(o));
17 else
18 seq.append(lin(o, heap, scope, ids, omitFields));
19 }
20 return seq;
21 }
```
Fig. 3. Linearization algorithm

fields (for the original version see [36]). linearize starts a depth-first traversal of the heap from the root, by invoking lin in line 3. To canonicalize the heap, lin assigns different identifiers to the different objects it visits. Map ids stores the mapping between objects and unique object identifiers. When an object is visited for the first time, it is assigned a new unique identifier (lines 10-11), and a singleton sequence with the identifier is created to represent the object (line 12). Then, the object's fields, sorted in a predefined order (e.g., by name), are traversed and the linearization of each field value is constructed, and the result is appended to the sequence representing the current object (lines 13-19). A field storing a primitive value is represented by a singleton sequence with the primitive value (line 15-16). If a field references an object, a recursive call to lin converts the object into a sequence, which will be appended to the result (line 18). At the end of the loop, seq contains the canonical representation of the whole rooted heap starting at root, and is returned by lin (line 20). When an already visited object is traversed by a recursive call, the object must have an identifier already assigned in ids (line 6), and lin returns the singleton sequence with the object's unique identifier (lines 7). When more than scope objects are reachable from the rooted heap, lin returns an exception to report that the scope has been exceeded (lines 9-10). The exception will be employed later on by BEAPI to discard test sequences that create objects larger than allowed by the scope. linearize also takes as a parameter a regular expression omitFields, that matches the names of the fields that must be omitted during canonicalization (see Section 3.1). To omit such fields, we implemented sortByField (line 13) in such a way that it does not return the edges corresponding to fields whose names match omitFields. This in turn avoids saving the values of omitted fields in the sequence yielded by linearize. Finally, notice that linearization allows for efficient comparison of objects (rooted heaps): two objects are equal if and only if their corresponding sequences yielded by linearize are equal.

### 3.3 Builders Identification Approach

As the feasible combinations of methods grow exponentially with the number of methods, it is crucial to reduce the number of methods that BEAPI uses to produce test sequences. We employ an automated builders identification approach [27] to find a subset of API methods that are sufficient for the generation of the bounded exhaustive structure sets. We call such routines builders. The previous approach to identify a subset of sufficient builders from an API is based on a genetic algorithm, but is computationally expensive [27]. Here, we consider a simpler hill climbing approach (HC), that achieves better performance. HC may of course be less precise, as it may include some methods in the resulting set of builders that might not be needed to produce a bounded exhaustive set of structures. However, HC worked very well and consistently computed minimal sets of builders in our experiments (we checked that the set of builders computed by HC matched the set of builders we manually identified for each case study). Our goal here is to assess the impact of using builders for BEG from an API. Comparing the HC approach against existing techniques is left for future work.

Let API=m1, m2, . . . , m<sup>n</sup> be the set of API methods. HC explores the search space of all subsets of methods from API. HC requires the user to provide a scope s (in the same way as in BEAPI). The fitness f(sm) of a given set sm of methods is the number of distinct structures (after canonicalization) that BEAPI generates using the set, for the given scope s. We also give priority in the fitness to sets of methods with less and simpler parameter types (see [27] for further details). The successors succs(sm) for a candidate sm are the sets sm∪{mi}, for each m<sup>i</sup> ∈ API. HC starts by computing the fitness of all singletons {c} of constructor methods. The best of the singletons is set as the current candidate curr, and HC starts a typical iterative hill climbing process. At each iteration HC computes f(succ) for each succ ∈ succs(curr). Let best be the successor with the highest fitness value. Notice that best has exactly one more method than the best candidate of the previous iteration, curr. If f(best) > f(curr), methods in best can be used to create a larger set of structures than those in curr. Thus, HC assigns best to curr, and continues with the next iteration. Otherwise, f(best) <= f(curr), and curr already generates the largest possible set of structures (no method could be added that increases the number of generated structures from curr). At this point, curr is returned as the set of identified builders.

Notice that HC performs many invocations to BEAPI for builders identification. The key insight that makes builders identification feasible is that often builders identified for a relatively small scope are the same set of methods that are needed to create structures of any size. In other words, once the scope for builders computation is large enough, increasing the scope will yield the same set of builders as a result. This result resembles the small scope hypothesis for bug detection [3] (and transcoping [31]). A scope of 5 was enough for builders computation in all our case studies (we manually checked that the computed builders were the right ones in all cases). After builders are identified efficiently using a small scope, we can run BEAPI with the identified builders using a larger scope, for example, to generate bigger objects to exercise the SUT. In most of our case

```
1 BEAPI(List methods, int scope, Map<Type, List<Seq>> primitives, Regex omitFields) {
2 Map<Type, List<Seq>> currSeqs = new Map();
3 currSeqs.addAll({ T->L | T->L in primitives });
4 Set canonicalStrs = new Set();
5 for (int it=0; true; it++) {
6 Map<Type, List<Seq>> newSeqs = new Map();
7 boolean newStrs = false;
8 for (m(T1,. . .,Tn):Tr: methods) {
9 Map<Type, List<Seq>> seqsT1 = currSeqs.getSequencesForType(T1);
10 . . .
11 Map<Type, List<Seq>> seqsTn = currSeqs.getSequencesForType(Tn);
12 for ((s1,. . .,sn): seqsT1 × . . . ×seqsTn) {
13 Seq newSeq = createNewSeq(s1,. . .,sn,m);
14 o1,. . .,on,or,failure,exception = execute(newSeq);
15 if (failure) throw new ExecutionFailedException(newSeq);
16 if (exception) continue;
17 c1,. . .,cn,cr,outOfScope = makeCanonical(o1,. . .,on,or,scope,omitFields);
18 if (outOfScope) continue;
19 if (isReferenceType(T1) and !canonicalStrs.contains(c1)) {
20 canonicalStrs.add(c1);
21 newSeqs.addSeqForType(T1, newSeq);
22 newStrs = true;
23 }
24 . . .
25 if (isReferenceType(Tr) and !canonicalStrs.contains(cr)) {
26 canonicalStrs.add(cr);
27 newSeqs.addSeqForType(Tr, newSeq);
28 newStrs = true;
29 }
30 }
31 }
32 if (!newStrs) break;
33 currSeqs.addAll(newSeqs);
34 }
35 return currSeqs.getAllSeqsAsList();
36 }
```
Fig. 4. BEAPI algorithm

studies, builders comprise a constructor and a single method to add elements to the structure. However, our automated builder identification approach showed that, for Red-Black Trees, a remove method was also required (for scopes greater than 3), since there are trees with a particular balance configuration (red and black coloring for the nodes) that cannot be constructed by just adding elements to the tree. In contrast, AVL trees, which are also balanced, do not require the remove method as a builder, and the class constructor and an add routine suffice. This shows that builders identification is non-trivial to perform manually, as it requires a very careful exploration of a very large number of structures and method combinations. Other structures that require more than two builders are binomial and Fibonacci heaps.

#### 3.4 The BEAPI Approach

A pseudocode of BEAPI is shown in Figure 4. BEAPI takes as inputs a list of methods from an API, methods (the whole API, or previously identified builders); the scope for generation, scope; a list of test sequences to create values for each primitive type provided in the scope description, primitives (automatically created from configuration options int.range, strings, etc., see Fig. 2); and a regular expression matching fields to be omitted in the canonicalization of structures, omitFields. Notice that methods from more than one class could be passed in methods if one wants to generate objects for several classes in the same execution of BEAPI, e.g., when methods from one class take objects from another class as parameters. BEAPI's map currSeqs stores, for each type, the list of test sequences that are known to generate structures of the type. currSeqs starts with all the primitive typed sequences in primitives (lines 2-3). At each iteration of the main loop (lines 5-34), BEAPI creates new sequences for each available method m (line 8), by exhaustively exploring all the possibilities for creating test sequences using m and inputs generated in previous iterations and stored in currSeqs (lines 9-30). The newly created test sequences that generate new structures in the current iteration are saved in map newSeqs (initialized empty in line 6); all the generated sequences are then added to currSeqs at the end of the iteration (line 33). If no new structures are produced at the current iteration (newStrs is false in line 32), BEAPI's main loop terminates and the list of all sequences in currSeqs is returned (line 35).

Let us now discuss the details of the for loop in lines 9-30. First, all sequences that can be used to construct inputs for m are retrieved in seqsT1,...,seqsTn. BEAPI explores each tuple (s1,...,sn) of feasible inputs for m. Then, it executes createNewSeq (line 13), which constructs a new test sequence newSeq by performing the sequential composition of test sequences s1,...,s<sup>n</sup> and routine m, and replacing m's formal parameters by the variables that create the required objects in s1,...,sn. newSeq is then executed (line 14) and it either produces a failure (failure is set to true), raises an exception that represents an invalid usage of the API (exception is set to true), or its execution is successful and it creates new objects o1,. . .,on,or. In case of a failure, an exception is thrown and newSeq is presented to the user as a witness of the failure (line 15). If a different kind of exception is thrown, BEAPI assumes it corresponds to an API misuse (see below), discards the test sequence (line 16) and continues with the next candidate sequence. Otherwise, the execution of newSeq builds new objects o1,. . .,on,o<sup>r</sup> (or values of primitive types) that are canonicalized by makeCanonical (line 17) –by executing linearize from Figure 3 on each structure. If any of the structures produced by newSeq exceeds the scope, makeCanonical sets outOfScope to true, BEAPI discards newSeq and continues with the next one (line 18). If none of the above happens, makeCanonical returns canonical versions of o1,. . .,on,o<sup>r</sup> in variables c1,. . .,cn,cr, respectively. Afterwards, BEAPI performs state matching by checking that the canonical structure c<sup>1</sup> is of reference type and that it has not been created by any previous test sequence (line 19). Notice that canonicalStrs stores all of the already visited structures. If c<sup>1</sup> is a new structure, it is added to canonicalStrs (line 27), and the sequence that creates c1, newSeq, is added to the set of test sequences producing structures of type T<sup>1</sup> (newSeqs in line 27). Also, newStrs is set to true to indicate that at least a new object has been created in the current iteration (line 22). This process is repeated for canonical objects c2,. . .,cn,c<sup>r</sup> (lines 24-29).

BEAPI distinguishes failures from bad API usage based on the type of the exception (similarly to previous API based test generation techniques [23]). For example, IllegalArgumentException and IllegalStateException correspond to API misuses, and the remaining exceptions are considered failures by default. BEAPI's implementation allows the user to select the exceptions that correspond to failures and those that do not, by setting the corresponding configuration parameters. As mentioned in Section 2, BEAPI assumes that API methods throw exceptions when they fail to execute on invalid inputs. We argue that this is a common practice, called defensive programming [17], that should be followed by all programmers, as it results in more robust code and improves software testing in general [2] (besides helping automated test generation tools). We also argued in Section 2 that the specification effort required for defensive programming is much less than writing precise (and efficient) repOKs for BEG, and that this was true after manually inspecting the source code of our case studies. On the other hand, note that BEAPI can employ formal specifications to reveal bugs in the API, e.g., by executing repOK and check that it returns true on every generated object of the corresponding type (as in Randoop [23]). However, the specifications used for bug finding do not need to be very precise (e.g., the underspecified NCL repOK from Section 2 is fine for bug finding), or written in a particular way (as required by Korat). Other kinds of specifications that are weaker and simpler to write can also be used by BEAPI to reveal bugs, like violations of language specific contracts (e.g., equals is an equivalence relation in Java), metamorphic properties [7], user-provided assertions (assert), etc.

Another advantage of BEAPI is that, for each generated object, it yields a test sequence that can be executed to create the object. This is in contrast with specification based approaches (that generate a set of objects from repOK). Finding a sequence of invocations to API methods that create a specific structure is a difficult problem on its own, that can be rather costly computationally [5], or require significant effort to perform manually. Thus, often objects generated by specification based approaches are "hardwired" when used for testing a SUT (e.g., by using Java reflection), making tests very hard to understand and maintain, as they depend on the low-level implementation details of the structures [5].

### 4 Evaluation

In this section, we experimentally assess BEAPI against related approaches. The evaluation is organized around the following research questions:


As case studies, we employ data structures implementations from four benchmarks: three employed in the assessment of existing testing tools (Korat [4], Kiasan [9], FAJITA [1]), and ROOPS. These benchmarks cover diverse implementations of complex data structures, which are a good target for BEG. We choose these as case studies because the implementations come equipped with repOKs, written by the authors of the benchmarks. The experiments were run on a workstation with an Intel Core i7-8700 CPU (3.2 Ghz) and 16Gb of RAM. We set a timeout of 60 minutes for each individual run. To replicate the experiments, we refer the reader to the paper's artifact [25].

### 4.1 RQ1: Efficiency of Bounded Exhaustive Generation from APIs

For RQ1 we assess whether or not BEAPI is fast enough to be a useful BEG approach, by comparing it to the fastest BEG approach, Korat [32]. The results of the comparison are summarized in Table 1. For each technique, we report generation times (in seconds), number of generated and explored structures, for increasingly large scopes. Due to space reasons, we show a representative sample of the results (we try to maintain the same proportion of good and bad cases for each technique in the data we report). We include the largest successful scope for each technique; the execution times for the largest scopes are in boldface in the table. In this way, should scalability issues arise, they can be easily identified. For the complete report of the results visit the paper's website [26]. To obtain proper performance results for BEAPI, we extensively tested the API methods of the classes to ensure they were correct for this experiment. We did not try to change the repOKs in any way because that would change the performance of Korat, and one of our goals here is evaluating the performance of Korat using repOKs written by different programmers. Differences in explored structures are expected, since the corresponding search spaces for Korat and BEAPI are different. However, for the same case study and scope, one would expect both approaches to generate the same number of valid structures. This is indeed the case in most experiments, with notable exceptions of two different kinds. Firstly, there are cases where repOK has errors; these cases are grayed out in the tables. Secondly, the slightly different notion of scope in each technique can cause discrepancies. This only happens for Red-Black Trees (RBT) and Fibonacci heaps (FibHeap), which are shown in boldface. In these cases certain structures of size n can only be generated from larger structures, with insertions followed by removals and then insertions again to trigger specific balance rearrangements. BEAPI discards generated sequences as soon as they exceed the maximum structure size, hence it cannot generate these structures.

In terms of performance, we have mixed results. In the Korat benchmark, Korat shows better performance in 4 out of 6 cases. In the FAJITA benchmark, BEAPI is better in 3 out of 4 cases. In the ROOPS benchmark, BEAPI is better in 5 out of 7 cases. In the Kiasan benchmark, Korat is faster in 6 of the 7 cases. We observe that BEAPI shows a better performance in structures with more restrictive constraints such as RBT and Binary Search Trees (BST); often these cases have a smaller number of valid structures. Cases where the number of valid structures grows faster with respect to the scope, such as doubly-linked lists (DLList), are better suited for Korat. More structures means BEAPI has


Table 1. Efficiency assessment of BEAPI against Korat

to create more test sequences in each successive iteration, which makes its performance suffer more in such cases. As expected, the way repOKs are written has a significant impact in Korat's performance. For example, for binomial heaps (BinHeap) Korat reaches scope 8 with Roops' repOK, scope 10 with FAJITA's repOK, and scope 11 with Korat's repOK (all equivalent in terms of generated structures). In most cases, repOKs from the Korat benchmark result in better performance, as these are fine-tuned for usage with Korat. Case studies with errors in repOKs are grayed out in the table, and discussed further in Section 4.3. Notice that errors in repOKs can severely affect Korat's performance.

### 4.2 RQ2: Impact of BEAPI's Optimizations


Table 2. Execution times (sec) of BEAPI under different configurations.


In RQ2 we assess the impact each of BEAPI's proposed optimizations has in BEG. For this, we assess the performance of four different BEAPI configurations: SM/BLD is BEAPI with state matching (SM) and builder identification (BLD) enabled; SM is BEAPI with only state matching (SM) enabled; BLD is BEAPI with only builders (BLD) identification enabled; NoOPT has both optimizations disabled. The left part of Table 2 summarizes the results of this experiment for the ROOPS benchmark; the right part reports preliminary results on five "real world" implementations of data structures: LinkedList (21 API methods), TreeSet (22 API methods), TreeMap (32 methods) and HashMap (29 methods) from java.util, and NCL from Apache Collections (20 methods). As most real world implementations, these data structures do not come equipped with repOKs, hence we only employed them in this RQ.

The brute force approach (NoOPT) performs poorly even for the easiest case studies and very small scopes. These scopes are too small and often not enough if one wants to generate high quality test suites. State matching is the most impactful optimization, greatly improving by itself the performance and scalability all around (compare NoOPT and SM results). As expected, builders identification is much more relevant in cases where the number of methods in the API is large (more than 10), and remarkably in the real world data structures (with 20 or more API methods). SM/BLD is more than an order of magnitude faster than SM in AVL and RBT, and it reaches one more scope in NCL and LList. The remaining classes of ROOPS have just a few methods, and the impact of using builders is relatively small. The conclusions drawn from ROOPS apply to the other three benchmarks (we omit their results here for space reasons, visit the paper's website for a complete report [26]). In the real world data structures, using precomputed builders allowed SM/BLD to scale to significantly larger scopes in all cases but TreeMap and TreeSet, where it significantly improves running times. Overall, the proposed optimizations have a crucial impact in BEAPI's performance and scalability, and both should be enabled to obtain good results.

On the cost of builders identification. Due to space reasons we report builders identification times in the paper's website [26]. For the conclusions of this section, it is sufficient to say that scope 5 was employed for builders identification in all cases, and that the maximum runtime of the approach was 65 seconds in the four benchmarks (ROOPS' SLL, 11 methods), and 132 seconds in the real world data structures (TreeMap, 32 methods). We manually checked that the identified methods included a set of sufficient builders in all cases. Notice that BEG is often performed for increasingly larger scopes, and the identified builders can be reused across executions. Thus, builder identification times are amortized across different executions, which makes it difficult to calculate how much builder identification times add to BEAPI running times in each case. So we did not include builder identification times in BEAPI running times in any of the experiments. Notice that, for the larger scopes, which arguably are the most important, builders identification time is negligible in relation to generation times.

### 4.3 RQ3: Analysis of Specifications using BEAPI

RQ3 addresses whether BEAPI can be useful in assisting the user in finding flaws in repOKs, by comparing the set of objects that can be generated using the API and the set of objects generated from the repOK. We devised the following automated procedure. First, we run BEAPI to generate a set SA of structures from the API, and Korat to generate a set SR from repOK, using the same scope for both tools. Second, we canonicalize the structures in both SA and SR using linearization (Section 3.2). Third, we compare sets SA and SR for equality. Differences in this comparison point out a mismatch between repOK and the API. There are three possible outputs for this automated procedure. If SA ⊂ SR, it is possible


Table 3. Summary of flaws found in repOKs using BEAPI

that the API generates a subset of the valid structures, that repOK suffers from underspecification (missing constraints), or both. In this case, the structures in SR that do not belong to SA are witnesses of the problem, and the user has to manually analyze them to find out where the error is. Here, we report the (manually confirmed) underspecification errors in repOKs that are witnessed by the aforementioned structures. In contrast, when SR ⊂ SA, it can be the case that the API generates a superset of the valid structures, that repOK suffers from overspecification (repOK is too strong), or both. The structures in SA that do not belong to SR might point out to the root of the error, and again they have to be manually analyzed by the user. We report the (manually confirmed) overspecification errors in repOKs that are witnessed by these structures. Finally, it can be the case that there are structures in SR that do not belong to SA, and there are structures (distinct than the former ones) in SA that do not belong to SR. These might be due to faults in the API, flaws in the repOK, or both. We report the manually confirmed flaws in repOKs witnessed by such structures simply as errors (repOK describes a different set of structures than the one it should). Notice that differences in the scope definitions for the approaches might make sets SA and SR differ. This was only the case in the RBT and FibHeap structures, where BEAPI generated a smaller set of structures for the same scope than Korat due to balance constraints (as explained in Section 4.1). However, these "false positives" can be easily revealed, since all the structures generated by Korat were always included in the structures generated by BEAPI if a larger scope was used for the latter approach. Using this insight we manually discarded the "false positives" due to scope differences in RBT and FibHeap.

The results of this experiment are summarized in Table 3. We found out flaws in 9 out of 26 repOKs using the approach described above. The high number of flaws discovered evidences that problems in repOKs are hard to find manually, and that BEAPI can be of great help for this task.

### 5 Related Work

BEG approaches have been shown effective in achieving high code coverage and finding faults, as reported in various research papers [20,16,4,33]. Our goal here is not to assess yet again the effectiveness of BEG suites, but to introduce an approach that is straightforward to use in today's software because it does not require the manual work of writing formal specifications of the properties of the inputs (e.g., repOKs). Different languages have been proposed to formally describe structural constraints for BEG, including Alloy's relational logic (in the so-called declarative style), employed by the TestEra tool [20]; and source code in an imperative programming language (in the so-called operational style), as used by Korat [4]. The declarative style has the advantage of being more concise and simpler for people familiar with it, however this knowledge is not common among developers. The operational style can be more verbose, but as specifications and source code are written in the same language this style is most of the time preferred by developers. UDITA [11] and HyTeK [29] propose to employ a mix of the operational and the declarative styles to write the specifications, as parts of the constraints are often easier to write in one style or the other. With precise specifications both approaches can be used for BEG. Still, to use these approaches developers have to be familiar with both specification styles, and take the time and effort required to write the specifications. Model checkers like Java Pathfinder [34] (JPF) can also perform BEG, but the user has to manually provide a "driver" for the generation: a program that the model checker can use to generate the structures that will be fed to the SUT afterwards. Writing a BEG driver often involves invoking API routines in combination with JPF's nondeterministic operators, hence the developer must familiarize with such operators and put in some manual effort to use this approach. Furthermore, JPF runs over a customized virtual machine in place of Java's standard JVM, so there is a significant overhead in running JPF compared to the use of the standard JVM (employed by BEAPI). The results of a previous study [32] show that JPF is significantly slower than Korat for BEG. Therein, Korat has been shown to be the fastest and most scalable BEG approach at the time of publication [32]. This in part can be explained by its smart pruning of the search space of invalid structures and the elimination of isomorphic structures. In contrast, BEAPI does not require a repOK and works by making calls to the API.

An alternative kind of BEG consists of generating all inputs to cover all feasible (bounded) program paths, instead of generating all feasible bounded inputs. This is the approach of systematic dynamic test generation, a variant of symbolic execution [14]. This approach is implemented by many tools [13,12,24,8], and has been successfully used to produce test suites with high code coverage, reveal real program faults, and for proving memory safety of programs. Kiasan [9] and FA- JITA [1] are also white-box test case generation approaches that require formal specifications and aim for coverage of the SUT.

Linearization has been employed to eliminate isomorphic structures in traditional model checkers [15,28], and also in software model checkers [35]. A previous study experimented with state matching in JPF and proposed several approaches for pruning the search space for program inputs using linearization, for both concrete and symbolic execution [35]. As stated before, concrete execution in JPF requires the user to provide a driver. The symbolic approach attempts to find inputs to cover paths of the SUT; we perform BEG instead. Linearization has also been employed for test suite minimization [36].

### 6 Conclusions

Software quality assurance can be greatly improved thanks to modern software analysis techniques, among which automated test generation techniques play an outstanding role [6,18,10,23,19,12,20,4,13]. Random and search-based approaches have shown great success in automatically generating test suites with very good coverage and mutation metrics, but their random nature does now allow these techniques to precisely characterize the families of software behaviors that the generated tests cover. Systematic techniques such as those based on model checking, symbolic execution or bounded exhaustive generation, cover a precise set of behaviors, and thus can provide specific correctness guarantees.

In this paper, we presented BEAPI, a technique that aims at facilitating the application of a systematic technique, bounded exhaustive input generation, by producing structures solely from a component's API, without the need for a formal specification of the properties of the structures. BEAPI can generate bounded exhaustive suites from components with implicit invariants, and reduces the burden of providing formal specifications, and tailoring the specifications for improved generation. Thanks to a number of optimizations, including an automated identification of builder routines and a canonicalization/state matching mechanism, BEAPI can generate bounded exhaustive suites with a performance comparable to that of the fastest specification-based technique Korat [4]. We have also identified the characteristics of a component that may make it more suitable for a specification-based generation, or an API-based generation.

Finally, we have shown how specification based approaches and BEAPI can complement each other, depicting how BEAPI can be used to assess repOK implementations. Using this approach, we found a number of subtle errors in repOK specifications taken from the literature. Thus, techniques that require repOK specifications (e.g, [30]), as well as techniques that require bounded-exhaustive suites (e.g., [21]) can benefit from our presented generation technique.

Acknowledgements This work was partially supported by ANPCyT PICTs 2017-2622, 2019-2050, 2020-2896, an Amazon Research Award, and by EU's Marie Sklodowska-Curie grant No. 101008233 (MISSION). Facundo Molina's work is also supported by Microsoft Research, through a LA PhD Award.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Feature-Guided Analysis of Neural Networks

Divya Gopinath1() , Luca Lungeanu<sup>4</sup> , Ravi Mangal<sup>2</sup> , Corina Păsăreanu1,<sup>2</sup> , Siqi Xie<sup>3</sup> , and Huanfeng Yu<sup>3</sup>

> <sup>1</sup> KBR, NASA Ames, Moffett Field CA 94035, USA divya.gopinath@nasa.gov

<sup>2</sup> Lynbrook High School, San Jose CA, 95129, USA

<sup>3</sup> Carnegie Mellon University, Pittsburgh PA 15213, USA

<sup>4</sup> Boeing Research & Technology, Santa Clara CA, USA

Abstract. Applying standard software engineering practices to neural networks is challenging due to the lack of high-level abstractions describing a neural network's behavior. To address this challenge, we propose to extract high-level task-specific features from the neural network internal representation, based on monitoring the neural network activations. The extracted feature representations can serve as a link to high-level requirements and can be leveraged to enable fundamental software engineering activities, such as automated testing, debugging, requirements analysis, and formal verification, leading to better engineering of neural networks. Using two case studies, we present initial empirical evidence demonstrating the feasibility of our ideas.

Keywords: Features, Neural Networks, Software Engineering

### 1 Introduction

The remarkable computational capabilities unlocked by neural networks have led to the emergence of a rapidly growing class of neural-network based software applications. Unlike traditional software applications whose logic is driven from input-output specifications, neural networks are inherently opaque, as their logic is learned from examples of input-output pairs. The lack of high-level abstractions makes it challenging to interpret the logical reasoning employed by a neural network and hinders the use of standard software engineering practices such as automated testing, debugging, requirements analysis, and formal verification that have been established for producing high-quality software.

In this work, we aim to address this challenge by proposing a feature-guided approach to neural network engineering. Our proposed approach is illustrated in Figure 1. We draw from the insight that, in a neural network, early layers typically extract the important features of the inputs and the dense layers close to the output contain logic in terms of these features to make decisions [12]. The approach therefore first extracts high-level, human-understandable feature representations from the trained neural network which allows us to formally link domain-specific, human-understandable features to the internal logic of a trained model. This in turn enables us to reason about the model through the lens of the features and to drive the above mentioned software engineering activities.

Fig. 1: Proposed Approach

For feature representations, we seek to extract associations between the activation values at the intermediate layers and higher-level abstractions that have clear semantic meaning (e.g., objects in a scene or weather conditions). We present an algorithm to extract these high-level feature representations in the form of rules (pre =⇒ post) where the precondition (pre) is a box over the latent space at an internal layer and the postcondition (post) denotes the presence (or absence) of the feature.

The formal, checkable rules enable us to evaluate the quality of the datasets, retrieve and label new data, understand scenarios where models make correct and incorrect predictions, detect incorrect (or out-of-distribution) samples at run-time, and verify models against human-understandable requirements.

We evaluate our algorithm for extracting feature representations and the downstream analyses using two networks trained for computer vision tasks, namely TaxiNet [4,9], a regression model for center-line tracking on airport runways, and YOLOv4-Tiny [14], an object detection model trained on the nuImages [6] dataset for autonomous driving.

### 2 Extracting Feature Representations

Algorithm 2.1 describes the method for extracting the representation of a particular feature from a trained neural network. A feed-forward neural network f : R <sup>n</sup> → R <sup>m</sup> is organized in multiple layers, each consisting of computational units called neurons. Each neuron takes a weighted sum of the outputs from the previous layer and applies a non-linear activation function on it. The algorithm requires a small dataset D where each raw input is labeled with 0 or 1 indicating whether the feature under consideration is absent or present. The algorithm takes as inputs a neural network f, the dataset D, the index l of the layer used for extracting the feature representations. The first step of the algorithm (line 2) is to construct a new dataset A where each raw input x is replaced by the corresponding activation value a output by layer l (f l (x) denotes the output of f at layer l for input x). Next, the algorithm invokes a learning procedure to learn a classifier r that separates activation values that map to feature being present from activation values that map to feature absence (line 3).

### Algorithm 2.1: Extracting Feature Representations

Inputs: A neural classifier f ∈ R <sup>n</sup> → R <sup>m</sup>, dataset D ⊆ R <sup>n</sup> × {0, 1}, |D| = N, and layer l ∈ {1, . . . , k − 1}, where k is the number of layers in f Output: Representation r for the feature

1 FeatRep(f, D, l): 2 A := {(a, y) | (x, y) ∈ D ∧ a = f l (x)} //f l is the output of f at layer l 3 r := Learn(A) 4 return r

We use decision tree learning on line 3 to extract feature representations as a set of rules of the form pre ⇒ {0, 1}; pre in each rule is a condition on neuron values at layer l, and 0 or 1 indicates whether the rule corresponds to the feature being absent or present. pre is a box in the activation space of layer l, i.e., V Nj∈N<sup>l</sup> (N<sup>j</sup> (x) ∈ [v L j , v<sup>U</sup> j ]). Here N<sup>l</sup> is the set of neurons at layer l, and v L j and v U j are lower and upper bounds for the output of neuron N<sup>j</sup> . The rules mined by decision-tree learning partition the activation space at a given inner layer. Some partitions may be impure containing inputs both with and without the feature. We only select pure rules, having 100% precision on d. We return these rules as r. Note that there can be activation values for which no rule in r is satisfied and we are unable to say whether the feature is absent or present.

### 3 Feature-Guided Analyses

The extracted feature representations as formal, checkable rules enable multiple analyses, as listed below.


One can also check overlap between feature rules, using off-the-shelf decision procedures, to uncover spurious correlations between the different features that are learned by the network. We envision many other applications for these rules, whose exploration we leave for the future.

# 4 Case Studies

We use two case studies to present initial empirical evidence in support of our ideas. In particular, we show that Algorithm 2.1 with decision tree learning is successful in extracting feature representations. We also demonstrate how these representations can be used for analyzing the behavior of neural networks.

### 4.1 Center-line Tracking with TaxiNet

We first analyzed TaxiNet, a perception model for center-line tracking on airport runways [4,9]. It takes runway images as input and produces two outputs, crosstrack (CTE) and heading angle (HE) errors which indicate the lateral and angular distance respectively of the nose of the plane from the center-line of the runway. We analyzed a CNN model provided by our industry partner, with 24 layers including three dense layers (100/50/10 neurons) before the output layer. It is critical that the TaxiNet model functions correctly and keeps the plane safe without running off the taxiway. The domain experts provided a specification for correct output behavior: |y<sup>0</sup> − y0ideal | ≤ 1.0m ∧ |y<sup>1</sup> − y1ideal | ≤ 5 degrees. One can evaluate the model correctness using Mean Absolute Error (MAE) on a test set (CTE:0.366, HE:1.645).

Feature Elicitation We first need to identify the high-level features that are relevant for the task. These could be some of the simulator parameters (for images generated from a simulator) and/or could be derived from high-level system (natural language) requirements. This is a challenging process requiring several iterations in collaboration with the domain experts. We obtained a list of 10 features: center-line, shadow, skid, position, heading, time-of-day, weather, visibility, intersection (junction) and objects (runway lights, birds, etc.) and values of interest for each feature respectively.

Data Analysis and Annotations We manually annotated a subset of 450 images from the test set with values for each feature. An initial data-coverage

Table 1: Rules for TaxiNet: d: annotated dataset, #d: total number of instances for that feature value in d, Rd: recall (%) on d, Pv,Rv: precision (%) and recall (%) on validation set. Rules with highest R<sup>d</sup> are shown.


Fig. 2: Images satisfying rules for features

analysis of the distribution of the values for every feature across all the images, revealed many gaps. For instance, there were only day-time images, with only cloudy weather and all the images had high visibility. Also apart from runway lights, there were no images with any other objects on the runway. The analysis proved already useful, providing feedback to the experts with regard to the type of images that need to be added to improve the training and testing of the model. Extracting Feature Rules We invoke Algorithm 2.1 to obtain rules in terms of the values of the neurons at the three dense layers of the network. Note that for each feature, we mined a separate rule for every value of interest. We used half of the annotated set of 450 images for extraction (d in Algorithm 2.1) and the remaining for validation of the rules. There are multiple rules extracted for each feature; each rule is associated with a support value (# of instances in d satisfying the rule) and has 100% precision on them since we only extract pure rules. The results are summarized in Table 1, indicating some high-quality rules (for "center-line present" , "shadow present" , "light skid", "position left", "position right"), measured on the validation set.

Figure 2 displays some of the images satisfying different rules. The corresponding heat maps were created by computing the image pixels impacting the neurons in the feature rule [7]. Note that for the "center-line present" rule, the part of the

image impacting the rule (highlighted in red) is the center-line, indicating that indeed the rules identify the feature. On the other hand, in the absence of the center-line, it is unclear what information is used by the model (and the image leads to error). The heatmaps for the shadow and skid also correctly highlight the part of the image with the shadow of the nose and the skid marks. We used such visualization techniques to further validate the rules.

Labeling New Data The rules extracted based on a small set of manually annotated data can be leveraged to annotate a much larger data set. We used the rules for center-line (present/absent) to label all of the test data (2000 images). We chose the rule with highest R<sup>d</sup> for the experiments. However, more rules could be chosen to increase coverage. 1822 of the images satisfied the rule for "centerline present" and 79 images for "center-line absent". We visually checked some of the images to estimate the accuracy of the labelling. We similarly annotated more images for the shadow and skid features. These new labels enable further data-coverage analysis over the train and test datasets.

Feature-Guided Analysis We performed preliminary experiments to demonstrate the potential of feature-guided analyses. We first calculated the model accuracy (MAE) on subsets of the data labelled with the feature present and absent respectively. We also determined the % of inputs in the respective subsets violating the cor-



rectness property. The results are summarized in Table 2.

These results can be used by developers to better understand and debug the model behavior. For instance, the model accuracy computed for the subsets with "shadow present" and "dark skid", respectively, is poor and also a high % of the respective inputs violate the correctness property. This information can be used by developers to retrieve more images with shadows and dark skids, to retrain the model and improve its performance. The extracted rules can be leveraged to automate the retrieval.

Furthermore, we observe that in the absence of the center-line feature, the model has difficulty in making correct predictions. This is not surprising, as the presence of the center-line can be considered as a (rudimentary) input requirement for the center-line tracking application. Indeed, in the absence of the center-line it is hard to envision how the network can estimate correctly the airplane position from it. The network may use other clues on the runway, leading to errors. We can thus consider the presence of the center-line feature as part of the ODD for the application. The rules for the center-line feature can be deployed as a run-time monitor to either pass inputs satisfying the rules for "present" or reject those that satisfy the rules for "absent", ensuring that the model operates in the safe zone as defined by the ODD, and at the same time increasing its accuracy.

We also experimented with generating rules to explain correct and incorrect behavior in terms of combinations of features such as: (center − line present) ∧


Table 3: Rules for YOLOv4-Tiny (same metrics as in Table 1).

(shadow absent) ∧ (on position) =⇒ correct, and ¬(center − line present) ∧ (heading away)∧(position right) =⇒ incorrect. 1 . These rules could be further used by developers to better understand and debug the model behavior.

#### 4.2 Object Detection with YOLOv4-Tiny

We conducted another case study with a more challenging network, an object detector, to evaluate the quality of the extracted feature representations. For this study, we use the nuImages dataset, a public large-scale dataset for autonomous driving [1,6]. It contains 93000 images collected while driving around in actual cities. To facilitate computer vision tasks such as object detection for autonomous driving, each image comes labeled with 2d bounding boxes and the corresponding object labels (from one of 23 object classes). Each labeled object also comes with additional attribute annotations. For instance, the objects labeled vehicle carry additional annotations like vehicle.moving, vehicle.stopped, and vehicle.parked. Overall, the dataset has 12 categories of additional attribute annotations. We trained a YOLOv4-Tiny object detection model [14,2] on this dataset. YOLOv4- Tiny has 37 layers with 21 convolutional layers and 2 YOLO layers.

We leveraged the attribute annotations associated with each object as the feature labels (thus no manual labeling was necessary). For extracting feature representations, we run Algorithm 2.1 on a subset of 2000 images from the nuImages dataset, and then evaluate the extracted representations on a separate validation set of 2000 images.

Table 3 describes our results. We used layer 28 of the YOLOv4-Tiny model to extract the feature representations. For brevity, we only report the number of terms in the rule precondition, i.e., the number of neurons that appeared in the constraints, instead of describing the exact rule in Table 3. Note that layer 28 has 798720 neurons. Strikingly, the extracted rules only have between 10 to 25 terms in their preconditions, and yet achieve precision (Pv) between 69 − 74%. The recall (Rv) values are also encouraging, and can be improved further by

<sup>1</sup> The procedure to generate these rules has been omitted for brevity.

considering more than one rule for each feature value (here, we only consider pure rules with the highest recall R<sup>d</sup> on dataset d used for feature extraction).

### 4.3 Challenges and Mitigations

Identifying relevant features is non-trivial and requires refinement and extensive discussions with domain experts. The feature annotations may need to be provided manually which is expensive and error-prone. However, we only need a small annotated dataset to extract the representations, which can be used to further annotate unlabeled data. The extracted rules may be incorrect (e.g., due to unbalanced annotated data). We mitigate by carefully validating them using a separate validation set and visualization techniques. It could also be that the network did not learn some important features. To address the issue, in future work, we plan to investigate neuro-symbolic approaches to build networks that are aware of high-level features and satisfy (by construction) the safety requirements.

### 5 Related Work

There is growing interest in developing software engineering approaches for machine learning in general, and neural networks specially, investigating requirements for neural networks [3], automated testing [16], debugging and fault localization [8], to name a few. Our work contributes with a feature-centric view of neural network behavior that links high-level requirements with the internal logic of the trained models to enable better testing and analysis of neural networks.

A closely related work [18] uses high-level features to guide neural network analysis. However, the features are extracted from input images, not from the internal neural network representation. Further, the work only considers testing, not other software engineering activities.

Our work is also related to concept analysis [17,11,15,13] which seeks to develop explanations of deep neural network behavior in terms of concepts specified by users. We propose to use high-level features for multiple software engineering activities, which go beyond explanations. Moreover, the use of decision tree learning makes our representations relatively cheap to extract. Note that there are other works that use decision tree learning to distill neural network input-output behavior, e.g., [5]; however none of them extract high-level features from the network's internal representation.

### 6 Conclusion

We proposed to extract high-level feature representations related to domainspecific requirements to enable analysis and explanation of neural network behavior. We presented initial empirical evidence in support of our ideas. In future work, we plan to further investigate meaningful requirements for neural networks and

effective techniques for checking them. We also plan to apply Marabou [10] for the verification of safety properties expressed in terms of high-level features. Finally, we plan to investigate neuro-symbolic techniques to develope high-assurance neural network models.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# JavaBIP meets VerCors: Towards the Safety of Concurrent Software Systems in Java<sup>⋆</sup>

Simon Bliudze<sup>1</sup> , Petra van den Bos<sup>2</sup> , Marieke Huisman<sup>2</sup> , Robert Rubbens2() , and Larisa Safna<sup>1</sup>

<sup>1</sup> Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, 59000 Lille, France {simon.bliudze,larisa.safina}@inria.fr

<sup>2</sup> Formal Methods and Tools, University of Twente, Enschede, The Netherlands {p.vandenbos,m.huisman,r.b.rubbens}@utwente.nl

Abstract. We present "Verifed JavaBIP", a tool set for the verifcation of JavaBIP models. A JavaBIP model is a Java program where classes are considered as components, their behaviour described by fnite state machine and synchronization annotations. While JavaBIP guarantees execution progresses according to the indicated state machines, it does not guarantee properties of the data exchanged between components. It also does not provide verifcation support to check whether the behaviour of the resulting concurrent program is as (safe as) expected. This paper addresses this by extending the JavaBIP engine with run-time verifcation support, and by extending the program verifer VerCors to verify JavaBIP models deductively. These two techniques complement each other: feedback from run-time verifcation allows quicker prototyping of contracts, and deductive verifcation can reduce the overhead of run-time verifcation. We demonstrate our approach on the "Solidity Casino" case study, known from the VerifyThis Collaborative Long Term Challenge.

### 1 Introduction

Modern software systems are inherently concurrent: they consist of multiple components that run simultaneously and share access to resources. Component interaction leads to resource contention, and if not coordinated properly, can compromise safety-critical operations. The concurrent nature of such interactions is the root cause of the sheer complexity of the resulting software [9]. Modelbased coordination frameworks such as Reo [5] and BIP [6] address this issue by providing models with a formally defned behaviour and verifcation tools.

JavaBIP [10] is an open-source Java implementation of the BIP coordination mechanism. It separates the application model into component behaviour, modelled as Finite State Machines (FSMs), and glue, which defnes the possible stateless interactions among components in terms of synchronisation constraints. The overall behaviour of an application is to be enforced at run time

<sup>⋆</sup> L. Safna and S. Bliudze were partially supported by ANR Investissements d'avenir (ANR-16- IDEX-0004 ULNE) and project NARCO (ANR-21-CE48-0011). P. van den Bos, M. Huisman and R. Rubbens were supported by the NWO VICI 639.023.710 Mercedes project.

by the framework's engine. Unlike BIP, JavaBIP does not provide automatic code generation from the provided model; instead it realises the coordination of existing software components in an exogenous manner, relying on component annotations that provide an abstract view of the software under development.

To model component behaviour, methods of a JavaBIP program are annotated with FSM transitions. These annotated methods model the actions of the program components. Computations are assumed to be terminating and nonblocking. Furthermore, side-efects are assumed to be either represented by the change of the FSM state, or to be irrelevant for the system behaviour. Any correctness argument for the system depends on these assumptions. A limitation of JavaBIP is that it does not guarantee that these assumptions hold. This paper proposes a joint extension of JavaBIP and VerCors [11] providing such guarantees about the implementation statically and at run time.

VerCors [11] is a state-of-the-art deductive verifcation tool for concurrent programs that uses permission-based separation logic [3]. This logic is an extension of Hoare logic that allows specifying properties using contract annotations. These contract annotations include permissions, pre- and postconditions and loop invariants. VerCors automatically verifes programs with contract annotations. To verify JavaBIP models, we (i) extend JavaBIP annotations with verifcation annotations, and (ii) adapt VerCors to support JavaBIP annotations. VerCors was chosen for integration with JavaBIP because it supports multithreaded Java, which makes it straightforward to express JavaBIP concepts in its internal representation. To analyze JavaBIP models, VerCors transforms the model with verifcation annotations into contract annotations, leveraging their structure as specifed by the FSM annotations and the glue.

For some programs VerCors requires extra contract annotations. This is generally the case with while statements and when recursive methods are used. To enable properties to be analysed when not all necessary annotations are added yet, we extend the JavaBIP engine with support for run-time verifcation. During a run of the program, the verifcation annotations are checked for that specifc program execution at particular points of interest, such as when a JavaBIP component executes a transition. The run-time verifcation support is set up in such a way that it ignores any verifcation annotations that were already statically verifed, reducing the overhead of run-time verifcation.

This paper presents the use of deductive and run-time verifcation to prove assumptions of JavaBIP models. We make the following contributions:


Tool binaries and case study sources are available through the artefact [7].

### 2 Related Work

There are several approaches to analyse behaviours of abstract models in the literature. Bliudze et al. propose an approach allowing verifcation of infnite state BIP models in the presence of data transfer between components [8]. Abdellatif et al. used the BIP framework to verify Ethereum smart contracts written in Solidity [1]. Mavridou et al. introduce the VeriSolid framework, which generates Solidity code from verifed models [13]. André et al. describe a workfow to analyse Kmelia models [4]. They also describe the COSTOTest tool, which runs tests that interact with the model. Thus, these approaches do not consider verifcation of model implementation, which is what we do with Verifed JavaBIP. Only COSTOTest establishes a connection between the model and implementation, but it does not guarantee memory safety or correctness.

There is also previous work on combining deductive and runtime verifcation. The following discussion is not exhaustive. Generally, these works do not support concurrent Java and JavaBIP. Nimmer et al. infer invariants with Daikon and check them with ESC/Java [14]. However, they do not check against an abstract model, and the results are not used to optimize execution. Bodden et al. and Stulova et al. optimize run-time checks using static analysis [12,16]. However, Stulova et al. do not support state machines, and Bodden et al. do not support data in state machines. The STARVOORS tool by Ahrendt et al. is comparable to Verifed JavaBIP [2]. Some minor diferences include the type of state machine used, and how Hoare triples are expressed. The major diference is that it is not trivial to support concurrency in STARVOORS. VerCors and Verifed JavaBIP use separation logic, which makes concurrency support straightforward.

### 3 JavaBIP and Verifcation Annotations

JavaBIP annotations capture the FSM specifcation and describe the behaviour of a component. They are attached to classes, methods or method parameters, and were frst introduced by Bliudze et al [10]. Listing 1 shows an example of JavaBIP annotations. @ComponentType indicates a class is a JavaBIP component and specifes its initial state. In the example this is the WAITING state. @Port declares a transition label. Method annotations include @Transition, @Guard and @Data. @Transition consists of a port name, start and end states, and optionally a guard. The example transition goes from WAITING to PINGED when the PING port is triggered. The transition has no guard so it may always be taken. @Guard declares a method which indicates if a transition is enabled. @Data either declares a getter method as outgoing data, or a method parameter as incoming data. Note that the example does not specify when ports are activated. This is specifed separately from the JavaBIP component as glue [10].

We added component invariants and state predicates to Verifed JavaBIP as class annotations. @Invariant(expr) indicates expr must hold after each component state change. @StatePredicate(state, expr) indicates expr must hold in state state. Pre- and postconditions were also added to the @Transition annotation. They have to hold before and after execution of the transition. @Pure

S. Bliudze et al. 146

```
1 @Port ( name = PING , type = PortType . enforceable )
2 @ComponentType ( initial = WAITING , name = ECHO_SPEC )
3 public class Echo {
4 @Transition ( name = PING , source = WAITING , target = PINGED )
5 public void ping () { System . out . println ( this + " : pong " );}}
```
Fig. 1. Verifed JavaBIP architecture. Ellipse boxes represent analysis or execution.

indicates that a method is side-efect-free, and is used with @Guard and @Data. Annotation arguments should follow the grammar of Java expressions. We do not support lambda expressions, method references, switch expressions, new, instanceOf, and wildcard arguments. In addition, as VerCors does not yet support Java features such as generics and inheritance, models that use these cannot be verifed. These limitations might be lifted in the future.

### 4 Architecture of Verifed JavaBIP

The architecture of Verifed JavaBIP is shown in Figure 1. The user starts with a JavaBIP model, optionally with verifcation annotations. The user then has two choices: verify the model with VerCors, or execute it with the JavaBIP Engine.

We extended VerCors to transform the JavaBIP model into the VerCors internal representation, Common Object Language (COL). An example of this transformation is given in Listing 2. If verifcation succeeds, the JavaBIP model is memory safe, has no data races, and the components respect the properties specifed in the verifcation annotations. In this case, no extra run-time verifcation is needed. If verifcation fails, there are either memory safety issues, components violate properties, or the prover timed out. In the frst case, the user needs to change the program or annotations and retry verifcation with VerCors. This is necessary because memory safety properties cannot be checked with the JavaBIP engine, and therefore safe execution of the JavaBIP model is not guaranteed. In the second and third case, VerCors produces a verifcation report with the verifcation result for each property.

We extended the JavaBIP engine with run-time verifcation support. If a verifcation report is included with the JavaBIP model, the JavaBIP engine uses it to only verify at run-time the verifcation annotations that were not verifed deductively. If no verifcation report is included, the JavaBIP engine verifes all verifcation annotations at run time.

```
1 @Transition ( name = PING , source = PING , target = PING , guard = HAS_PING )
2 public void ping () { pingsLeft - -; }
```

```
1 requires PING_state_predicate () && hasPing ();
2 ensures PING_state_predicate ();
3 public void ping () { pingsLeft - -; }
```
Listing 2. Top: example of a transition in JavaBIP. Bottom: internal representation of ping after encoding JavaBIP semantics.

### 5 Implementation of Verifed JavaBIP

This section briefy discusses relevant implementation details for Verifed JavaBIP.

Run-time verifcation in the JavaBIP engine is performed by checking component properties after component construction, and before and after transitions. For example, before the JavaBIP engine executes a transition, it checks the component invariant, the state invariant, and the precondition of the transition. When a property is violated, either execution is terminated or a warning is printed, depending on how the user confgured the JavaBIP engine. We expect runtime verifcation performance to scale linearly, as properties can be checked individually. We have not measured the impact of the use of refection in the JavaBIP engine.

For deductive verifcation the JavaBIP semantics is encoded into COL. We describe this with an example. The top part of Listing 2 shows the ping method, where @Transition indicates a transition from PING to PING. The guard indicates that the transition is allowed if there is a ping. HAS\_PING refers to a method annotated with @Guard(name=HAS\_PING), which returns pingsLeft >= 1.

The bottom part of Listing 2 shows the COL representation of the ping method after encoding the JavaBIP semantics. Line 1 states the precondition, line 2 the postcondition. PING\_state\_predicate() refers to the PING state predicate, which constrains the values of the class felds. By default it is just true. Since the predicate is both a pre- and a postcondition, it is assumed at the start of the method, and needs to hold at the end of the method. hasPing() is the method with the @Guard annotation for the HAS\_PING label. The method is called directly in the COL representation. We have implemented such a transformation of JavaBIP to COL for each JavaBIP construct.

To prove memory safety, we extended VerCors to generate permissions. This ensures verifcation in accordance with the Java memory model. Currently, each component owns the data of all its felds. This works for JavaBIP models that do not share data between components. For other models, a diferent approach might be necessary, e.g. VerCors taking into account permissions annotations provided by the user. For more info about permissions, we refer the reader to [3].

Finally, scalability of deductive verifcation of JavaBIP models could be a point of future work, as the number of proof obligations scales quadratically in the number of candidate transitions of a synchronization.

# 6 VerifyThis Casino and Verifed JavaBIP

We illustrate Verifed JavaBIP with the Casino case study adapted from [17]. We discuss the case study and its verifcation. The case study sources and Verifed JavaBIP sources and binaries are included in the artefact [7].

The model uses three component types: player, operator, and casino. The model supports multiple players and casinos, but each casino has only one operator. Players bet on the result of a coin fip. The casino pays out twice for a correct guess, and keeps the money otherwise. The casino contains the pot balance and money reserved for the current bet. The operator can add to or withdraw money from the casino pot based on a local copy of the casino pot.

We have added several invariants to this model. The purse of every player, the casino pot, its operator copy, the wallet of the operator, and the placed bet must all be non-negative, as the model does not support debts. If no bet is placed, it must be zero. These properties are defned as @Invariant or @StatePredicate annotations on the components in the model.

One problem with the model is that the player can win more than the casino pot contains, because there are no restrictions on how much the player can bet. The problem is detected by both deductive and run-time verifcation. VerCors cannot prove that the casino pot is non-negative, which is part of the casino invariant, after the PLAYER\_WIN transition. The JavaBIP engine detects it, but is not guaranteed to because the model has some non-determinism. For example, if no player ever wins the problem is not detected by run-time verifcation.

There are several solutions. First, the user can choose to always enable runtime verifcation, such that the execution is always safe. This might be acceptable depending on the performance penalty of run-time verifcation. Second, guards can be added to restrict model behaviour. For example, PLACE\_BET could require bet <= pot. However, in general, adding guards might introduce deadlocks. Third, a solution is to refactor the model to avoid the problem. For example, the casino could limit how much the player can bet. This introduces no extra run-time checks, however, in general the behaviour of the model will change.

## 7 Conclusions and Future Work

We presented Verifed JavaBIP, a tool set for verifying the assumptions of JavaBIP models and their implementations. The tool set extends the original JavaBIP annotations for verifcation of functional properties. Verifed JavaBIP supports deductive verifcation using VerCors, and run-time verifcation using the JavaBIP engine. Only properties that could not be verifed deductively are checked at runtime. In the demonstration we automatically detect a problem on the Casino case study using Verifed JavaBIP.

There are several directions for future work. First, support for checking memory safety could be extended by supporting data sharing between components. Second, we want to investigate run-time verifcation of memory safety. Third, more experimental evaluation can be done on the capabilities and performance of Verifed JavaBIP. Fourth and fnally, we want to investigate run-time verifcation of safety properties of the JavaBIP model beyond invariants.

### References


1016/S1571-0661(04)00256-7, RV'2001, Runtime Verifcation (in connection with CAV '01)


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Model-based Player Experience Testing with Emotion Pattern Verification

Saba Gholizadeh Ansari1() , I. S. W. B. Prasetya<sup>1</sup> Davide Prandi<sup>2</sup> , Fitsum Meshesha Kifetew<sup>2</sup> , Mehdi Dastani<sup>1</sup> , Frank Dignum<sup>3</sup> , and Gabriele Keller<sup>1</sup>

<sup>1</sup> Utrecht University, Utrecht, The Netherlands, s.gholizadehansari@uu.nl <sup>2</sup> Fondazione Bruno Kessler, Trento, Italy <sup>3</sup> Ume˚a University, Ume˚a, Sweden

Abstract. Player eXperience (PX) testing has attracted attention in the game industry as video games become more complex and widespread. Understanding players' desires and their experience are key elements to guarantee the success of a game in the highly competitive market. Although a number of techniques have been introduced to measure the emotional aspect of the experience, automated testing of player experience still needs to be explored. This paper presents a framework for automated player experience testing by formulating emotion patterns' requirements and utilizing a computational model of players' emotions developed based on a psychological theory of emotions along with a model-based testing approach for test suite generation. We evaluate the strength of our framework by performing mutation test. The paper also evaluates the performance of a search-based generated test suite and LTL model checking-based test suite in revealing various variations of temporal and spatial emotion patterns. Results show the contribution of both algorithms in generating complementary test cases for revealing various emotions in different locations of a game level.

Keywords: automated player experience testing, agent-based testing, modelbased testing, models of emotion

# 1 Introduction

Player experience (PX) testing has become an increasingly critical aspect of game development to assist game designers in realistically anticipating the experience of game players in terms of enjoyment [17], flow [46] and engagement [31]. While functional testing is intended to test the functionality of the game [38], the PX testing verifies whether emotions and psychology of players shaped during the interaction with the game are close to the design intention. This helps game designers in early development stages to identify design issues leading to game abandon, improve the general experience of players and even invoke certain experience during the game-play [53,3,25]. Let us also clarify that 'usability' is a concept in the broad domain of PX testing, but not the only concept. Usability tests are designed to address issues that can lead to degrading the human performance during the game-play [10], whereas PX can target the emotional experience of a player which eventually influences the success or failure of a game in the market [1]. This has led to the emergence of Games User Research (GUR) as an approach to gain insights into PX which is tied to human-computer-interaction, human factors, psychology and game development [14].

Validating a game design relies either on trained PX testers or acquiring information directly from players with methods such as interviews, questionnaires and physiological measurements [40,37], which are labour-intensive, costly and not necessarily representing all users profiles and their emotions. Moreover, such tests need to be repeated after every design change to assure the PX is still aligned with the design intention. Thus, GUR researchers have turned into developing AI-based PX testing methods. In particular, agent-based testing has attracted attention because it opens new rooms for automated testing of PX by imitating players while keeping the cost of labour and re-applying the tests low.

There exist appraisal theories of emotions that address the elicitation of emotions and their impact on emotional responses. They indicate that emotions are elicited by appraisal evaluation of events and situations [33]. Ortony, Clore, and Collins (OCC) theory [43] is one of several widely known appraisal theories in cognitive science that is also commonly used in modeling emotional agents [15,9,47,42,12]. Despite the influence of emotions on forming the experience of players [39,13], this approach has not been employed in PX testing [6].

In our automated PX testing approach, we opt for a model-driven approach to model emotions. Theoretical models of human cognition, used for decades in cognitive psychology, provide a more coherent outlook of cognitive processes. In contrast, applying a data-driven (machine learning) approach is greatly constrained by the availability of experimental data. Inferring a cognitive process from limited experimental data is an ill-posed problem [5] because such a process is subjective. Individuals can evaluate the same event differently due to age, gender, education, cultural traits, etc. For example, when a romantic relationship ends, some individuals feel sadness, others anger, and some even experience relief [48]. However, according to appraisal theories of emotions, common patterns can be found in emergence of the same emotion. These patterns are given as a structure of emotions by the aforementioned OCC. Thus, a model-driven approach derived form a well-grounded theory of emotions such as OCC, is sensible when access to a sufficient data is not possible.

In this paper, we present an agent-based player experience testing framework that allows to express emotional requirements as patterns and verify them on executed test suites generated by model-based testing (MBT) approach. The framework uses a computational model of emotions based on OCC theory [21] to generate the emotional experience of agent players. Comparing to [21], this paper contributes to expressing emotion patterns' requirements and generating covering test suites for verifying patterns on a game level. We show such a framework allows game designers to verify the emotion patterns' requirements and gain insight on emotions the game induce, over time and over space.

Revealing such patterns requires a test suite that can trigger enough diversity in the game behavior and as a result in the evoked emotions. This is where the model-based testing approach with its fast test suites generation can contribute. In this paper, we employ an extended finite state machine (EFSM) model [18] that captures all possible game play behaviors serving as a subset of human behaviors, at some level of abstraction. We use a search based algorithm (SB) for testing, more precisely multi objective search algorithm (MOSA) [44], and linear temporal logic (LTL) for model checking (MC) [8,11] as two model-based test suite generation techniques to investigate the ability of each generated test suite in revealing variations of emotion e.g absence of an emotion in a corridor. We apply test-cases distance metric to measure test suites' diversity and the distance between SB and MC test suites. Results on our 3D game case study shows that SB and MC, due to their different techniques for test generation, produce distinctive test cases which can identify different variations of emotions over space and time, that cannot be identified by just one of the test suites.

The remainder of this paper is organized as follows. Section 2 explains the computational model of emotions and the model-based testing approach. Section 3 presents the PX framework architecture. Section 4 describes our methodology of expressing PX requirements using emotion patterns, test suites diversity measurement, and the overall PX testing algorithm. Section 5 shows an exploratory case study to demonstrate the emotion pattern verification using model-based testing along with an investigation on results of SB and MC test suite generation techniques. Mutation testing is also addressed in this section to evaluate the strength of the proposed approach. Section 6 gives an overview of related work. Finally, Section 7 proposes future work and concludes the paper.

### 2 Preliminaries

This section summarizes the OCC computational model of emotions [21] and the model-based testing as key components of our PX framework.

#### 2.1 Computational Model of Emotions

Gholizadeh Ansari et al. [21] introduces a transition system to model goaloriented emotions based on a cognitive theory of emotions called OCC. The OCC theory gives a structure for 22 emotion types, viewed as cognitive processes where each emotion type is elicited under certain conditions. The structure is constructed based on the appraisal theory which is validated with a series of experiments in psychology [50,16,49]. The appraisal conditions, exist in the OCC, are modeled formally in [21] for six goal-oriented emotion types (ety), namely: hope, joy, satisfaction, fear, distress, and disappointment for a single agent simulations where the agent's emotional state changes only by game dynamism expressed through events to the agent. A game is treated as an environment that discretely produces events triggered by the agent's actions or environmental dynamism such as hazards. The event tick represents the passage of time. The emotion model of an agent is defined as a 7-tuple transition system M:

(S, s0, G, E, δ, Des, Thres)

	- K is a set of propositions the agent believes to be true. It includes, for each goal g, a proposition status(g, p) indicating if g has been achieved or failed, and a proposition P(g, v) with v∈[0..1], stating the agent's current belief on the likelihood of reaching this goal.
	- Emo is the agent's emotional state represented by a set of active emotions, each is a tuple hety, w, g, t0i, ety is the emotion type, w is the intensity of the emotion respecting a goal g, and triggered time t0.

The transition function δ updates the agent's state hK, Emoi, triggered by an incoming event e ∈ E as follows:

> hK , Emoi <sup>e</sup> −−−−→ hK 0 , updated emotion Emo<sup>0</sup> z }| { newEmo(K, e, G) ⊕ decayed(Emo)i


Emotion activation. One or multiple emotions can be activated by an incoming event (except tick). This is formulated as follows:

$$newEno(K, e, G) = \{ \langle ety, g, w, t \rangle \mid ety \in Etype, \ g \in G, w = \mathcal{E}\_{ety}(K, e, g) > 0 \} \tag{1}$$

where w is the intensity of the emotion ety towards the goal g ∈ G and t is the current system. Upon an incoming event, the above function is called to check the occurrence of new emotions as well as re-stimulation of existing emotions in Emo for every g ∈ G. Eety(K, e, g) internally calculates an activation potential value and compares it to a threshold T hresety; a new emotion is only triggered if the activation potential value exceeds the threshold. These thresholds might vary according to players' characters and their moods. For instance, when a person is in a good mood, their threshold for activating negative emotions go up which conveys they become more tolerant before feeling negative-valenced emotions. There is also a memory (emhistory) of activated emotions in the past for some reasonable time frame. This is maintained implicitly in the emotions' activation functions. The activation function of each emotion, based on provided definitions in the OCC theory, is as bellows, where x, v and v 0 refer to the goal's importance, the goal likelihood in previous and the new state respectively.

$$\begin{array}{lll} \mathbf{1} - \mathcal{E} \mathop{\mathbf{H}ep\mathbf{e}}\nolimits(K, e, g) &= \overbrace{\underbrace{v' \ast x}^{\text{activation intensity}} - \mathop{\mathbf{T}}\nolimits\_{Hps}}^{\text{activation intensity}}\\ \text{provided } g = \langle id, x \rangle \in G, \ \mathbf{P}(g, v) \in K, \ \mathbf{P}(g, v') \in e(K), \text{ and } v < v' < 1.}\\ \mathbf{1} &= \begin{pmatrix} \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} & \mathbf{1} \end{pmatrix} \end{array}$$


Emotion decay. An emotion intensity in Emo declines over time, triggered by tick events. This is formulated with a decay function over intensity as follows:

$$decayed(Emo) = \{ \langle ty, g, w', t\_0 \rangle \mid \langle ety, g, w, t\_0 \rangle \in Emo, \ w' = \mathsf{decay}\_{\mathsf{aty}}(w\_0, t\_0) > 0 \} \tag{2}$$

where w<sup>0</sup> is the initial intensity of ety for the goal g at time t0; this is stored in emhistory. decayety which is a decay function defined as an inverse exponential function over the peak of intensity (w0) at time t0.

### 2.2 Model-based Testing with EFSM

Since automated testing is a major challenge for the game industry due to the complexity and hugeness of games' interaction space, a recent development is to apply a model-based approach for test generation. [30,52,18]. For this purpose, an extended finite state machine (EFSM) M can be used which is a finite state machine (FSM), extended with a set V of context variables that allows the machine to have richer concrete states than the abstract states of its base FSM

[2]. Transitions t in M take the form n l/g/α −−−→ n <sup>0</sup> where n and n <sup>0</sup> are source and destination abstract states of the transition, l is a label, g is a predicate over V that guards the transition, and α is a function that updates the variables in V .

Figure 1 shows an example of a small level in a game called Lab recruits <sup>4</sup>which is the case study of this paper as well. A Lab recruits level is a maze with a set of rooms and interactive objects, such as doors and buttons. A level might also contain fire hazards The player's goal is to reach the object gf0. Access to it is guarded by door3, so reaching it involves opening the door using a button, which in turn is in a different room, guarded by another door, and so on. Ferdous et al. [18] employs a combined search-based and model-based testing for functional bug detection in this game using EFSM model (Figure 1). In the model, all interactable objects are EFSM states: doors (3), buttons (4), and the goal object gf0. For each door<sup>i</sup> , dip and dim are introduced to model the two sides of the door. The model has three context variables representing the state of each door (open/close). A solid edged transition on the model is unguarded, modelling the agent's trip from one object to another without walking through a door. A dotted transition models traversing through a door when the door is open. A dashed self loop transition models pressing a button; it toggles the status of the doors connected to the pressed button. Notice that the model captures the logical behavior of the game. It abstracts away the physical shape of the level, which would otherwise make the model more complicated and prone to changes during development. Given such a model, abstract test cases are constructed as sequences of consecutive transitions in the model. This paper will extend the EFSM model-based testing approach [18] for player experience testing.

Fig. 1: A game level in the Lab Recruits game and its EFSM model [18].

# 3 PX Testing Framework

The proposed automated PX testing framework aims to aid game designers for PX assessment of their games by providing information on the time and place of emerged emotions and their patterns which would ultimately determine the general experience of player. E.g. if these patterns do not fulfill design intentions, game properties can be altered and testing process can be repeated.

<sup>4</sup> https://github.com/iv4xr-project/labrecruits

Figure 2 shows the general architecture of the framework. There are four main components: a Model-based Testing component for generating tests, the Model of Emotions component implements the computational model of emotions from Section 2.1, an Aplib basic test agent [45] for controlling the in-game playercharacter, and the PX Testing Tool as an interface for a game designer towards the framework. The designer needs to provide these inputs, see 1 in Figure 2:


Given the EFSM model, the Model-based testing component, <sup>2</sup> in Figure 2, generates a test suite consisting of abstract test cases to be executed on the game under test (GUT). The test generation approach is explained in Section 4.1. Due to the abstraction of the model, emotion traces cannot be obtained from pure on-model executions. They require the executions of the test cases on the GUT. An adapter is needed to convert the abstract test cases into actual instructions for the GUT. The Aplib basic test agent does this conversion.

Attaching the Model of Emotions to the basic test agent creates an emotional test agent, 3 in Figure 2, which is able to simulate emotions based on incoming events. Via a plugin, the emotional test agent is connected to the GUT. Each test case of the test suite is then given to the agent for execution. The agent computes its emotional state upon observing events and records it in a trace file. Finally, when the whole test suite is executed, the PX Testing Tool analyzes the traces to verify given emotional requirements and provide heat-maps and timeline graphs of emotions for the given level ( <sup>4</sup> in Figure 2).

Fig. 2: Automated PX testing framework architecture.

# 4 Methodology

This section describes the framework's model-based test generation techniques and our approach to measure a test suite's diversity. Then, our approach for expressing emotion pattern requirements and verifying them are explained.

### 4.1 Test Suite Generation

A test generation algorithm is applied to produce abstract test cases from a model with respect to a given coverage criterion. From now on, we refer to these abstract test cases simply as test cases. In our context, game designers can evaluate the game experience by evaluating emerging emotional experience through various paths to the game's goal. So, a proper test suite needs to cover various variations of player behavior to expose various emotion patterns. Here, we aim at graph-based coverage, such as transition coverage. However, since the model of emotions from Section 2.1 is goal-oriented, some adjustment is needed:

Definition 1. Transition-goal coverage over an EFSM model M with respect to a goal state g is a requirement to cover all transitions in M, where a transition t is covered by a test case if its execution passes t and terminates in g.

Given the above definition, the PX framework uses the following complementary test generation approaches; one is stochastic and the other is deterministic.

Search Based Test generation Search based testing (SBT) formulates testing problems as an optimization problem in which a search algorithm is used to find an optimized solution, in the form of a test suite, that satisfies a given test adequacy criterion encoded as a fitness function [36]. Meta-heuristic algorithms such as genetic algorithm [23] and tabu [22,26] are commonly used for this. Our framework uses an open source library EvoMBT [18] that comes with several state of the art search algorithms e.g. MOSA [44]. We utilize this to produce a test suite satisfying e.g. the criterion in Def.1 to represent players' potential behavior in the game, which are then executed to simulate their emotional experience.

To apply MOSA, individual encoding, search operators and a fitness function need to be defined. An individual I is represented as a sequence of EFSM transitions. Standard crossover and mutation are used as the search operators. MOSA treats each coverage target as an independent optimisation objective. For each transition t, the fitness function measures how much of an individual I is actually executable on the model and how close it is from covering t as in Def.1. MOSA then evolves a population that minimize the distances to all the targets.

LTL model checking test generation Model checking is the second technique we use for test generation. This technique is originally introduced for automated software verification that takes a finite state model of a program as an input to check whether given specifications hold in the model [8]. Such specifications can be formulated in e.g. LTL which is a powerful language for expressing system properties over time. When the target formula is violated, a model checker produces a counter example in the form of an execution trace to help debugging the model. This ability is exploited for producing test cases by encoding coverage targets as negative formulas, and converting the produced counter examples to test cases [4,11,20]. We use this to generate test suites satisfying the coverage criterion in Def.1, encoded as an LTL properties. For each transition t : n<sup>1</sup> → n<sup>2</sup> in the EFSM model, the transition-goal coverage requirement to cover t is encoded as the following LTL formula:

$$\phi\_t = \neg g \mathcal{U} \left( n\_1 \land \mathcal{X} (n\_2 \land \neg g \mathcal{U} g \;) \right),$$

where g is the goal state like gf0 in Figure 1. The model checking algorithm checks whether ¬φ<sup>t</sup> is valid on the EFSM model using depth-first traversal [29]. If it is not, a counter example is produced that visits t and terminates in g. An extra iteration is added to find the shortest covering test case.

#### 4.2 Test Suite Diversity

Diversity is an approach to measure the degree of variety of the control and data flow in software or a game[41]. We use this approach to measure the diversity of test suites obtained from the generators in Section 4.1. A test suite's diversity degree is the average distance between every pair of distinct test cases, which can be measured in e.g. the Jaro distance metric. For a test case tc, let tc and |tc| be its string representation and its length respectively. The Jaro distance between two test cases of tc<sup>i</sup> and tc<sup>j</sup> is calculated as follows:

$$\begin{array}{rcl} \text{Dis } J\_{aro}(tc\_i, tc\_j) &= \begin{cases} 1 & , \text{ if } m = 0\\ 1 - \frac{1}{3} \left( \frac{m}{|\frac{lc\_i}{|\frac{lc\_i}{|}}|} + \frac{m}{|\frac{lc\_j}{|\frac{lc\_j}{|}}|} + \frac{m-t}{m} \right) & , \text{ if } m \neq 0 \end{cases} \end{array} \tag{3}$$

where m is the number of matching symbols in two strings whose distance is less than b|tc<sup>i</sup> |/2c, assuming tc<sup>i</sup> is the longer string; and t is half of the number of transpositions. Then, the diversity of a test suite T S is a summation of distances between every pair of distinct test cases, divided by the number such pairs:

$$Div\_{avg}(TS) = \frac{\sum\_{i=1}^{|TS|} \sum\_{j=i+1}^{|TS|} Dis\\_Jaro(tc\_i, tc\_j)}{\frac{|TS| \bullet \ \{|TS|-1\}}{2}} \tag{4}$$

where |T S| is T S' size. Additionally, if T S<sup>1</sup> and T S<sup>2</sup> are two test suites, the average distance between them is:

$$Dis\\_avg(TS\_1, TS\_2) = \frac{\sum\_{tc\_i \in TS \cdot 1\_i, tc\_j \in TS\_2} Dis\\_Jac(tc\_i, tc\_j)}{|TS\_1| \, \*|TS\_2|} \tag{5}$$

This is used in Section 5 to measure the distance between the test suites generated by the two approaches (Section 4.1) provided by our framework, along with their complementary effects on revealing different emotion patterns.

### 4.3 Emotion Patterns' Requirements and Heat-maps

In Section 2.1, we described the emotion model of an agent. When the agent executes a test case, it produces a trace of its emotion state over time. Such a trace is a sequence of tuples (t, p, Emo) where t is a timestamp, Emo is the agent emotion state at time t, and p is its position. Running a test suite produces a set of such traces.We define emotion patterns to capture the presence or absence of an emotional experience in a game. Such a pattern is expressed by a string of symbols, each representing the stimulation, or lack of a certain emotion type.

Definition 2. An emotion pattern is a sequence of stimulations e or ¬e, where e is one of the symbols H, J, S, F, D and P. Each represents the stimulation of respectively hope, joy, satisfaction, fear, distress, and disappointment.

A single pattern such as F represents the stimulation of the corresponding emotion, in this case fear. We will restrict ourselves to simply mean that this stimulation occurs, without specifying e.g. when it happens exactly, nor for how long it is sustained. A negative single pattern such as ¬F represents the absence of stimulation, in this case fear. A pattern is a sequence of one or more single patterns, specifying in what order the phenomenon that each single pattern describes is expected to occurs. Patterns provide a simple, intuitive, but reasonably expressive way to express PX. For example, the pattern JF S is satisfied by traces where the agent at some point becomes satisfied (S) after a stimulation of joy (J), but in between it also experiences a stimulation of fear at least once. Another example is J¬F S when there is no stimulation of fear between J and S. The presence of this pattern indicates the presence of a 'sneak' route, where a goal is achievable without the player has to fight enough for it.

As a part of PX requirements, developers might insist on presence or absence of certain patterns. More precisely, given a pattern p, we can pose these types of requirements: Sat(p) requires that at least one execution of the game under test satisfies p; UnSat(p) requires that Sat(p) does not hold; and V alid(p) requires that all executions satisfy p. In the context of testing, we will judge this by executions of the test cases in the given test suite T S.

Heat-maps Whereas above we discuss emotion patterns over time, a heatmap shows patterns over space. Assuming the visitable parts of a game level form a 2D surface, we can divide it into small squares of u×u. Given a position p and a square s, we can check if p∈s. Given a trace τ , let Emo(s) = {Emo | (t, p, Emo) ∈ τ, p∈s}: the set of emotions, that occur in the square s. This can be aggregated by a function aggr that maps Emo(s) to R. An example of an aggregator is the function max<sup>e</sup> that calculates the maximum of a specific emotion e (e.g. hope). Section 5 will show some examples. Such maps can be analyzed against requirements, e.g. that the aggregate values in certain areas should be of a certain intensity. We can also create an aggregated heatmap of an entire test suite by merging the traces of its test cases into a single trace, and then calculate the map from the combined trace. Finally, the overall methodology of our PX testing is summarized in Algorithm 1.

#### 4.4 PX Framework Implementation

The test agent is implemented using APlib Java library [45]. It has a BDI architecture [27] with a novel goal and tactical programming layer. We use JOCC library [21] for modeling emotions. To facilitate the model-based testing, we integrate EvoMBT[18]. It generates abstract test suites from an EFSM model, utilizing Evo-Suite [19] for search-based test generation. An implementation of LTL model checking algorithm is employed to produce model checking-based test suites. The framework and its data will be available for public use.

```
Algorithm 1 The Execution of automated PX
Testing Algorithm.
```

```
Input: EFSM M, coverage criterion C,
configuration parameters Conf ig for test generator,
and emotion pattern requirements' list R.
   Output: Emotion traces, Heat-maps of emotions,
and the verification of requirements' (true/false).
1: procedure Exec(M, C, Conf ig, R)
2: T Sabstract← TSGenerate(M, C, Conf ig)
3: T Sconcrete← Translate(T Sabstract)
4: Configure an emotional test agent A
5: tracesemotion ← ∅
6: for all test cases tc ∈ T Sconcrete do
7: τ ← A executes tc on the SUT
8: tracesemotion ← tracesemotion ∪ {τ}
9: end for
10: Hmaps← GenerateHeat-maps(tracesemotion)
11: V result ← { (r, Verify(r)) | r ∈ R }
12: return (tracesemotions, Hmaps, V results)
13: end procedure
```
### 5 Case Study

This section presents an exploratory case study conducted to investigate the use of a model-based PX testing framework<sup>5</sup> for verifying emotion requirements in a game-level and to investigate the difference between the search based generated test suite and the model checker generated test suites on revealing emotion patterns. Finally, we run mutation testing to evaluate the strength of our framework.

#### 5.1 Experiment Configuration

Figure 3 shows a test level called Wave-the-flag in the Lab Recruits, a configurable 3D game, designed for AI researchers to define their own testing problems.

It is a medium sized level, consisting of a 1182 m<sup>2</sup> navigable virtual floor, 8 rooms, 12 buttons, and 11 doors. Its EFSM model consists of 35 states and 159 transitions. The player starts in the room marked green at the top, and must find a 'goal flag' gf0 marked red in the bottom room to finish the level. Doors and buttons form a puzzle in the game. A human player needs to disclose the connections between buttons and doors to open a path through

<sup>5</sup> https://doi.org/10.5281/zenodo.7506758

Fig. 3: Wave-the-flag level.

the maze to reach the aforementioned goal flag in a timely manner. The player can earn points by opening doors and lose health in case of passing fire flames. For the test agent, the latter is also observable as an event called Ouch. If the player runs out of health, it loses the game. The player also has no prior knowledge about the position of doors, buttons and the goal flag, nor the knowledge on which buttons open which doors. Since there are multiple paths to reach the target, depending on the path that the player chooses to explore, it might be able to reach the goal without health loss, at one end of spectrum, or it can end up dead at the other end. The EFSM model (not shown) of the Wave-the-flag level is constructed similar to the running example in section 2.2. To add excitement, Wave-the-flag also contains fire flames. However, these flames are not included into the EFSM model because the placement and amount of these objects are expected to change frequently during development. Keeping this information in the EFSM model would force the designer to constantly update the model after each change in flames. Thus, similar to the running example, the EFSM model contains doors, buttons and goal flags.

In addition to the EFSM model, we need to characterize a player to do PX testing ( <sup>1</sup> in Figure 2). Table 1 shows basic characteristics of a player, defined with a set of parameters, to configure the emotion model of the agent before the execution. The level designer determines values of these parameters. After the execution of the model, we asked the designer to check the plausibility their values by checking the emotional heat-map results. The designer checked randomly selected number of test cases with their generated emotional heat maps to check the occurrence of emotions are reasonable. Thus, the utilized values for the following experiment is confirmed reasonable by the designer. Moreover, The likelihood of reaching the goal gf0 is set to 0.5 in the initial state to model a player who initially feels unbiased towards the prospect of finishing the level. Thus, the agent feels an equal level w of hope and fear at the beginning.

### 5.2 PX Testing Evaluation

Test suites are generated from the EFSM model using LTL model checking (MC) and the search-based (SB) approach with the full transition-goal coverage criterion (Def.1) named as T SSB and T SMC , both with 60 seconds time budget.

Abstract test suite characteristics. Our reason for using multiple test generation algorithms is to improve the diversity of the generated test cases, which in turn would improve our ability to reveal more emotion patterns. Table 2 shows the basic characteristics of the generated test suites. Due its stochastic behavior, the search-based (SB) generation is repeated 10 times, and then averaged. The SB algorithm manages to provide full transition-goal coverage with, in average, 54.6 test cases (σ = 7.8), with the average diversity of 0.192 (σ = 0.03) between test cases in a test suite. The model checker (MC) always satisfies the criterion with 74 test cases and average diversity of 0.113. The higher diversity of SB test suites (T SSB) can be explained through the stochastic nature of the search algorithm. Table 2 also shows the length of the shortest and longest test cases. While SB manages to find a shorter test case with only 17.7 transitions Table 1: Configuration of Player Characterization. G is the agent's goal set; it has one goal for this level, namely reaching the goal-flag gf0, s<sup>0</sup> is the emotion model's initial state, a set of relevant events (E) needs to be defined by the designers: DoorOpen event, triggered when a new door gets open, is perceived as increasing the likelihood of reaching gf0 by v<sup>1</sup> in the model, Ouch event, that notifies fire burn, is perceived as declining the likelihood of reaching gf0 by v2, GoalInSight event, triggered at the first time the agent observes the goal gf0 in its vicinity , is modelled as making the agent believes that the likelihood of reaching the goal becomes certain (1), and finally GoalAccomplished event is triggered when the goal gf0 is accomplished. Des reflects the desirability/undesirability of each event with respect to the goal and T hres is the emotions' activation thresholds. x, vi, and y<sup>i</sup> are constants determined by the designer.


in average, its longest test case has in average 74.25 transitions. Finally, the last row in Table 2 indicates the difference between SB and MC test suites. The distance between two test suites is measured for every generated T SSB using Equation 5 which brings about 0.214 (σ = 0.024) distance in average between test cases of the two suites. Later, we investigate whether such a difference can lead to differences at the execution level in emotion patterns.

Table 2: Characteristics of LTL-model checking-based and search-based test suites with respect to the same coverage criterion.


Evaluation of emotional heat-maps. Inspecting the emerging emotions requires real execution of test cases on the game under test. The execution of T SMC with 74 test cases and the T SSB with the average 54.5 test cases took 11,894 seconds and 10,201 respectively in the game. After the executions, the automated PX testing framework produces a heat-map of emotions for every test case to give spatial information about the intensity of the emotion at each location in the game. Unlike [21] which only produces heat-maps of emotions for a single pre-defined navigation path, Figure 4 shows the aggregated heat map visualization of some selected emotions, evoked during the execution of all test cases in T SMC and a randomly chosen T SSB suite from the previously generated 10 T SSB suites, with the square size u=1 and max<sup>e</sup> as the aggregation function. So, the maps show the maximum intensity on a given spot over the whole execution of the corresponding test suite. The brighter color shows the higher intensity of an emotion. In this case, the bright yellow represents the highest

emotional intensity in heat maps. The heat maps of hope, joy and satisfaction for these test suites show quite similar spatial information (only hope and joy are shown in Figure 4). However, T SMC generally shows a higher level of hope during the game-play (Figures 4a and 4b). So, if the designer verifies his level on the presence and spatial distribution of intensified hope through the level, the test cases produced by T SMC can expose these attributes better. This can be explained by the model checker setup to find shortest test cases; some can then open the next door sooner, raising hope before its intensity decays too much.

The maps also show a difference in the spatial coverage of T SSB and T SMC (marked green in Figures 4a and 4b). The transition that traverses the corridor is present in T SMC , but when the corresponding abstract test case is transformed into an executable test case for APlib test agent, they also incorporate optimization. So, it finds a more optimized way for execution by skipping the transition that actually passes the corridor towards the room, if the next transition is to traverse back along same corridor. The corridor is, however, covered by T SSB.

Fig. 4: Heat-map visualization of positive emotions for SBT and MC test suites.

The most striking differences between T SSB and T SMC are revealed in their negative emotions' heat-maps (Figure 5). Most places that are marked black as distress-free by executed T SMC (Figure 5a) are actually highly distressful positions for some test cases of T SSB. The presence of distress might be the intended player experience, whereas its absence in certain places might actually be undesirable. Upon closer inspection of individual test cases, it turns out that the test cases of T SSB that pass e.g. the red regions in Figure 5a and 5b always show distress in the marked corridor, whereas one test case in T SMC manages to find a 'sneak route' that passes the corridor without distress, and finishes the level successfully. Thus, if the designer is looking for the possibility of absence of distress in the sneak corridor, inspection of T SSB would not suffice. The heatmaps of disappointment reveals another difference. While T SMC only finds one location where the agent dies and feels disappointed, T SSB manages to find 3 more locations in the level with the disappointment state (Figure 5c).

The main reason behind those differences is that a sequence of transitions results in experiencing an emotion in the agent, not just a single transition. Furthermore, emotions intensity has a residual behavior which means a sequence of transitions and behavior might result in an emotion which still remains in the agent emotional state after some time. Thus, providing state coverage or the transition-coverage criterion does not in itself suffice to manage revealing possible emotions and their patterns. The variation of transitions and their order in a test case can resemble the different player behaviors during the game-play that their outcomes ultimately form the player emotional experience. Therefore, finding a proper test suite that can capture the distributions of theses emotions with test cases exhibiting the presence or absence of emotions in various locations is challenging. As remarked before, due to the stochastic nature of its algorithm, the search algorithm produces more diverse test suite than the LTL model checker, and hence can increase the chance of revealing more variation of emotions in different locations of the level. However, our experiments show the model checker does provide useful complementary test cases, e.g. for finding corner cases which can be covered only by the model checker that were missed by SB. All mentioned differences can explain the 0.20 distant difference between T SMC and T SSB.

Fig. 5: Heat-map visualization of negative emotions for SBT and MC test suites.

Checking emotion pattern requirements.The PX testing framework is also capable of verifying emotion requirements using patterns as defined in Definition 2 format based on stimulation of emotions. These patterns are verified by inspecting the order in which different emotions are stimulated, as recorded in the trace files. Although there are numerous combinations of emotions, only some of them matter for the designer to check. As a requirement, recall that a pattern can be posed as an existential requirement, i.e. Sat(p), or need to happen for all gameplays, i.e. V alid(p) or need to unwitnessed for all game-plays, i.e.USat(p). It is also essential to clarify that the choice of which emotion patterns are to be required can vary among game-levels, as expectations on the occurrences of patterns depend on the design goal. E.g. a game level with Sat(DHS) would provide at least one thrilling game-play. But if it is intended to be an easy level for beginners, the designer might insist on UnSat(DHS) instead.We have collected a number of emotion pattern requirements from the designer of the Wave-the-flag level; these are shown in the upper part of Table 3. The main expectation of the designer is to ensure that the designed level is enjoyable by experiencing different positive as well as negative emotions during the game-play and to avoid the player to get bored by interpreting boredom as absence of active emotions in the agent emotional state for some time. As can be seen in Table 3, while most requirements are verified during the test, there are requirements like Sat(J¬S) that are failed. This requirement indicates the designer expects at least one execution path that joy is stimulated

at least once thought the execution, but the agent never reaches the goal with satisfaction. Having Sat patterns failed to be witnessed, or UnSat patterns that are witnessed, assist the designer to alter their game level and run the agent through the level again. For example, here, the fail on Sat(J¬S) is an indication that the designer needs to put some challenging objects like fire or enemies in the vicinity of the goal gf0.

Table 3: Emotion pattern check with T SMC and T SSB. H= hope, F= fear, J= joy, D= distress, S= satisfaction, P = disappointment and ¬X = absence of emotion X.


Table 3 also shows the similar ratio of the pairwise combination of emotions over various Sat(p) for the pattern p between length 2-5 by the T SSB and the T SMC , indicating that both test suites can perform well to detect Sat-type emotion patterns. However, there the last three patterns in Table 3 are covered by T SSB but missed by T SMC . Thus, they are complementary, which makes it reasonable to use both test suites for verifying emotion pattern requirements.

### 5.3 Mutation Testing Evaluation

Mutation testing [32] is a technique to evaluate the quality of test suites in detecting faults, represented by faulty variants ('mutants') of the target program generated through a set of mutation operators. Here, we use this to evaluate the strength of our PX testing approach. In the procedure, we use a corrected Wavethe-flag level ('original' level), satisfying all the emotion pattern requirements we posed in Table 3. Mutations are applied on the original's level definition file to produce mutants (one mutation per mutant). An example of a mutation is to remove all fire flames from a certain zone in the level; Table 4 lists the used mutation operators. A mutant represents an alternate design of the level, maintaining the level's logic, but may induce different PX. To apply the mutations, the game level is divided to 16 zones of about equal size. We apply the mutation operators on each zone. Every mutant is labeled with the applied mutation operator and z x y where (x, y) specifies the bottom-left corner of the zone on which the mutation is applied. After dropping mutations that do not change the level's properties, we obtain 20 distinct mutants, from which we randomly choose 10 mutants for execution. We re-run both T SMC and T SSB test suites on each mutant. A mutant is automatically killed when the correctness of a specification is judged differently from the original results. Table 5 shows that 8 of the 10 randomly selected mutants are killed. Remaining mutants are not killed because emotion requirements might not be distinctive enough to kill them too.

Table 4: Mutation operators Code Description RF Remove fire RW2WF Relocate fire between walls RMRF Relocate fire in middle of a room AMRF Add fire in middle of a room AW2WF Add fire between walls

Threat to Validity. The designed character in Player Characterization, the selected coverage criterion for test generation to verify UnSat specifications, and the small number of mutation testing assessments due to the computational cost are internal threats to the validity of the work. In terms of external threats, performing the experiment on one level is not safe to be generalized.

### 6 Related Work

A number of research has been conducted on automated play testing to reduce the cost of repetitive and labor-intensive functional testing tasks in video games [35,54]. In particular, agent based testing has been a subject of recent research to play and explore the game space on behalf of human players for testing purposes. Ariyurek et al. [7] introduces Reinforcement Learning (RL) and Monte Carlo Tree Search (MCTS) agents to detect bugs in video games automatically. Stahlke et al. [51] presents a basis for a framework to model player's memory and goal-oriented decision-making to simulate human navigational behavior for identifying level design issues. The framework creates an AI-agent that uses a path finding heuristic to navigate a level, optimized by a given player characteristics such as level of experience and play-style. Zhao et al. [55] intend to create agents with human-like behavior for balancing games based on skill and play-styles. These parameters are measured using introduced metrics to help training the agents in four different case studies to test the game balance and to imitate players with different play-styles. Gordillo et al. [24] addresses the game state coverage problem in play-testing by introducing a curiosity driven reinforcement learning agent for a 3D game. The test agent utilizes proximal policy optimization (PPO) with a curiosity factor reflected on the RL reward function with frequency of a game state visit. Pushing the agent to have the exploratory behaviour provides a better chance to explore unseen states for bugs.

Among game model-based testing, Iftikhar et al. [30] applies it on Mario Brothers game for functional testing. The study uses UML states machine as a game model for test case generation which manages to reveal faults. Ferdous et al. [18] employs combined search-based and model-based testing for automated playtesting using an EFSM. Search algorithms are compared regarding the model coverage and bug detection. Note that while an EFSM provides paths through a game, it can not reveal the experience of a player who navigates the path.

Despite some research on modeling human players and their behavior in agents for automated functional play testing, there are a few research on automation of PX evaluation. Holmgard et al. [28] propose to create procedural personas or player characteristics for test agent to help game designers to develop game contents and desirable level design for different players. The research proposes to create personas in test agents using MCTS with evolutionary computation for node selection. The result on MiniDungeons 2 game shows how different personas brings about different behavior in response to game contents which can be seen as different play-styles. Lee et al. [34] investigate a data-driven cognitive model of human performance in moving-target acquisition to estimate the game difficulty for different players with different skill level. There is limited research on the emotion prediction and its usage for automation of PX evaluation. Gholizadeh et al. [21] introduce an emotional agent using a formal model of OCC emotions and propose the potential use of such an agent for PX assessment. However, the approach lacks automated path planning and reasoning, and hence it cannot do automated gameplay. Automatic coverage of game states and collecting all emerging emotions are thus not supported which are addressed in this paper.

# 7 Conclusion & Future work

This paper presented a framework for automated player experience testing, in particular automated verification of emotion requirement, using a computational model of emotions and model-based test generation targeting a subset of human players' behaviors. We presented a language for emotion patterns to capture emotion requirements. We also investigated the complementary impact of different test generation techniques on verifying spatial and temporal emotion patterns.

Future work. The explained language is able to capture complex patterns with the temporal order of emotions' stimulations in the framework. However, it cannot capture spatial behavior of emotions, such as differences in the heatmaps. Generally, combining spatial and temporal aspects to verify emotion requirements in specific areas and time intervals would give a more refined way to assess the emotional experience. How to capture this into formal patterns is still an open question. Investigating the application of our approach in empirical case studies with human players is future work.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Opportunistic Monitoring of Multithreaded Programs

Chukri Soueidi() , Antoine El-Hokayem , and Ylies Falcone `

Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France

{chukri.soueidi,antoine.el-hokayem, ylies.falcone}@univ-grenoble-alpes.fr

Abstract. We introduce a generic approach for monitoring multithreaded programs online leveraging existing runtime verification (RV) techniques. In our setting, monitors are deployed to monitor specific threads and only exchange information upon reaching synchronization regions defined by the program itself. They use the opportunity of a lock in the program, to evaluate information across threads. As such, we refer to this approach as opportunistic monitoring. By using the existing synchronization, our approach reduces additional overhead and interference to synchronize at the cost of adding a delay to determine the verdict. We utilize a textbook example of readers-writers to show how opportunistic monitoring is capable of expressing specifications on concurrent regions. We also present a preliminary assessment of the overhead of our approach and compare it to classical monitoring showing that it scales particularly well with the concurrency present in the program.

# 1 Introduction

Guaranteeing the correctness of concurrent programs often relies on dynamic analysis and verification approaches. Some approaches target generic concurrency errors such as data races [29, 37], deadlocks [11], and atomicity violations [28, 47, 57]. Others target behavioral properties such as null-pointer dereferences [27], and typestate violations [36, 38, 55] and more generally order violations with runtime verification [42]. In this paper, we focus on the runtime monitoring of general *behavioral* properties targeting violations that cannot be traced back to classical concurrency errors.

Runtime verification (RV) [9, 24, 25, 34, 42], also known as runtime monitoring, is a lightweight formal method that allows checking whether a run of a system respects a specification. The specification formalizes a behavioral property and is written in a suitable formalism based for instance on temporal logic such as LTL or finite-state machines [1, 45]. Monitors are synthesized from the specifications, and the program is instrumented with additional code to extract events from the execution. These extracted events generate the trace, which is fed to the monitors. From the monitor perspective, the program is a black box and the trace is the sole system information provided.

To model the execution of a concurrent program, verification techniques choose their trace collection approaches differently based on the class of targeted properties. When properties require reasoning about concurrency in the program, causality must be established during trace collection to determine the *happens-before* [40] relation between events. Data race detection techniques [29, 37] for instance require the causal ordering to check for concurrent accesses to shared variables; as well as predictive approaches targeting behavioral properties such as [19, 38, 55] in order to explore other feasible executions. Causality is best expressed as a partial order over events. Partial orders are compatible with various formalisms for the behavior of concurrent programs such as weak memory consistency models [2, 4, 46], Mazurkiewicz traces [32, 48], parallel series [43], Message Sequence Charts graphs [49], and Petri Nets [50]. However, while the program behaves non-sequentially, its observation and trace collection is sequential. Collecting partial order traces often relies on vector clock algorithms to timestamp events [3,16,47,53] and requires blocking the execution to collect synchronization actions such as locks, unlocks, reads, and writes. Hence, existing techniques that can reason on concurrent events are expensive to use in an online monitoring setup. Indeed, many of them are often intended for the design phase of the program and not in production environments (see Section 5).

Other monitoring techniques relying on total-order formalisms such as LTL and finite state machines require linear traces to be fed to the monitors. As such they immediately capture linear traces from a concurrent execution without reestablishing causality. Most of the top<sup>1</sup> existing tools for the online monitoring of Java programs, these include tools such as Java-MOP [18, 30] and Tracematches [5], provide multithreaded monitoring support using one or more of the following *two* modes. The *per-thread* mode specifies that monitors are only associated with a given thread, and receive all events of the given thread. This boils down to doing classical RV of single-threaded programs, assuming each thread is an independent program. In this case, monitors are unable to check properties that involve events across threads. The *global* monitoring mode spawns a global monitor and ensures that the events from different threads are fed to a central monitor atomically, by utilizing locks, to avoid data races. As such, the monitored program execution is *linearized* so that it can be processed by the monitors. In addition to introducing additional synchronization between threads inhibiting parallelism, this monitoring mode forces events of interest to be totally ordered across the entire execution, which oversimplifies and ignores concurrency.

Figure 1 illustrates a high-level view of a concurrent execution fragment of *1-Writer 2-Readers*, where a writer thread writes to a shared variable, and two other reader threads read from it. The reader threads share the same lock and can read concurrently once one of them acquires it, but no thread can write nor read while a write is occurring. We only depict the read/write events and omit lock acquires and releases for brevity. In this execution, the writer acquires the lock first and writes (event 1), then after one of the reader threads acquires the lock, they both concurrently read. The first reader performs 3 reads (events 2, 4, and 5), while the second reader performs 2 reads (events 3 and 6), after that the writer acquires the lock and writes again (event 7). A user

Fig. 1: Execution fragment of *1-Writer 2-Readers*. Double circle: write, normal: read. Numbers distinguish events. Events 2 and 6 (shaded) are example concurrent events.

<sup>1</sup> Based on the first three editions of the Competition on Runtime Verification [7, 8, 26, 52].

may be interested in the following behavioral property: *"Whenever a writer performs a write, all readers must at least perform one read before the next write"*. Note that the execution here has no data races nor a deadlock, and techniques focusing on generic concurrency properties are not suitable for the property. Monitoring of this (partial) concurrent execution with both previously mentioned modes presents restrictions. For *per-thread* monitoring, since each of the readers is a thread, and the writer itself is a thread, it cannot check any specification that refers to an interaction between them. For *global* monitoring, it imposes an additional lock operation to send each read event to the monitor, introducing additional synchronization and suppressing the concurrency of the program.

A central observation we made is that when the program is free from generic concurrency errors such as data races and atomicity violations, a monitoring approach can be opportunistic and utilize the available synchronization in the program to reason about high-level behavioral properties. In the previous example, we know that reads and writes are guarded by a lock and do not execute concurrently (assuming we checked for data races). We also know that the relative ordering of the reads between themselves is not important to the property as we are only interested in counting that they all read the latest write. As such, instead of blocking the execution at each of the 7 events to safely invoke a global monitor and check for the property, we can have thread-local observations and only invoke the global monitor once either one of the readers acquires the lock or when the writer acquires it (only 3 events). As such, in this paper, we propose an approach to opportunistic runtime verification. We aim to (i) provide an approach that enables users to arbitrarily reason about concurrency fragments in the program, (ii) be able to monitor properties *online* without the need to record the execution, (iii) utilize the existing tools and formalism prevalent in the RV community, and (iv) do so efficiently without imposing additional synchronization.

We see our contributions as follows. We present a generic approach to monitor lock-based multithreaded programs that enable the re-use of the existing tools and approaches by bridging *per-thread* and *global* monitoring. Our approach consists of a two-level monitoring technique where at both levels existing tools can be employed. At the first level, a thread-local specification checks a given property on the thread itself, where events are totally ordered. At the second level, we define *scopes* which delimit concurrency regions. Scopes rely on operations in the program guaranteed to follow a total order. The guarantee is ensured by the platform itself, either the program model, the execution engine (JVM in our case), or the compiler. We assume that scopes execute atomically at runtime. Upon reaching the totally ordered operations, a scope monitor utilizes the result of all thread-local monitors executed in the concurrent region to construct a scope state, and perform monitoring on a sequence of such states. Our approach can be seen as a combination of performing global monitoring at the level of scope (for our example, we utilize lock acquires) and per-thread monitoring for active threads in the scope. Thus, we allow per-thread monitors to communicate their results when the program synchronizes. This approach relies on existing ordered operations in the program. However, it incurs minimal interference and overhead as it does not add additional synchronization, namely locks, between threads in order to collect a trace.

Fig. 2: Concurrent execution fragment of 1-Writer 2-Readers. Labels l, u, w,r indicate respectively: lock, unlock, write, read. Actions with a double border indicate actions of locks. The read and write actions are filled to highlight them.

# 2 Modeling the Program Execution

We are concerned with an abstraction of a concurrent execution, we focus on a model that can be useful for monitoring the behavioral properties. We choose the smallest observable execution step done by a program and refer to it as an *action*; for instance a method call or write operation.

Definition 1 (Action). *An action is a tuple* hlbl, id, ctxi*, where:* lbl *is a label,* id *is a unique identifier, and* ctx *is the context of the action.*

The label captures an instruction name, function name, or specific task information depending on the granularity of actions. Since the action is a runtime object, we use id to distinguish two executions of the same syntactic element. Finally, the context (ctx) is a set containing dynamic contexts such as a thread identifier (threadid), process identifier (pid), resource identifier (resid), or a memory address. We use the notation id.lblthreadid resid to denote an action, omit resid when absent, and id when there is no ambiguity. Furthermore, we use the notation a.threadid for a given action a to retrieve the thread identifier in the context, and a.ctx(key) to retrieve any element in the context associated with key.

Definition 2 (Concurrent Execution). *A concurrent execution is a partially ordered set of actions, that is a pair* hA, →i*, where* A *is a set of actions and* → ⊆ A × A *is a partial order over* A*.*

Two actions a<sup>1</sup> and a<sup>2</sup> are related (i.e., ha1, a2i ∈→) if a<sup>1</sup> happens before a2.

*Example 1 (Concurrent fragment for 1-Writer 2-Readers.).* Figure 2 shows another concurrent execution fragment for *1-Writer 2-Readers* introduced in Sec. 1. The concurrent execution fragment contains all actions performed by all threads, along with the partial order inferred from the synchronization actions such as locks and unlocks (depicted with dashed boxes). Recall that a lock action on a resource synchronizes with the latest unlock if it exists. This synchronization is depicted by the dashed arrows. We have three locks: test for readers (t), service (s), and readers counter (c). Lock t checks if any reader is currently reading, and this lock gives preference to writers. Lock s is used to regulate access to the shared resource, it can be either obtained by readers or one writer. Lock c is used to regulate access to the readers counter, it only synchronizes readers. In this concurrent execution, first, the writer thread acquires the lock and writes on a shared variable whose resource identifier is omitted for brevity. Second, the readers acquire the lock s and perform a read on the same variable. Third, the writer performs a second write on the variable.

In RV, we often do not capture the entire concurrent execution but are interested in gathering a *trace* of the relevant parts of it. In our approach, a trace is also a concurrent execution defined over a subset of actions. Since the trace is the input to any RV technique, we are interested in relating a trace to the concurrent execution, while focusing on a subset of actions. For this purpose, we introduce the notions of *soundness* and *faithfulness*. We first define the notion of *trace soundness*. Informally, a concurrent execution is a sound trace if it does not provide false information about the execution.

Definition 3 (Trace Soundness). *A concurrent trace* tr = hAtr , →tri *is said to be a sound trace of a concurrent execution* e = hA, →i *(written* snd(e, tr )*) iff (i)* Atr ⊆ A *and (ii)* →tr ⊆ →*.*

Intuitively, to be sound, a trace (i) should not capture an action not found in the execution, and (ii) should not relate actions that are unrelated in the execution. While a sound trace provides no incorrect information on the order, it can still be missing information about the order. In this case, we want to also express the ability of a trace to capture all relevant order information. Informally, a *faithful trace* contains all information on the order of events that occurred in the program execution.

Definition 4 (Trace Faithfulness). *A concurrent trace* tr = hAtr , →tri *is said to be faithful to a concurrent execution* e = hA, →i *(written* faith(e, tr )*) iff* →tr ⊇ (→ ∩ Atr × Atr )*.*

# 3 Opportunistic Monitoring

We start with distinguishing threads and events from the execution. We then define scopes that allow us to reason about properties over concurrent regions. We then devise a generic approach to evaluate scope properties and perform monitoring.

### 3.1 Managing Dynamic Threads and Events

Threads are typically created at runtime and have a unique identifier. We denote the set of all thread ids by TID. They are subject to change from one execution to another, and it is not known in advance how many threads will be spawned during the execution. As such, it is important to design specifications that can handle threads dynamically.

Distinguishing Threads To allow for a dynamic number of threads, we first denote thread types T, to distinguish threads that are relevant to the specification. For example, the set of thread types for *readers-writers* is Trw = {reader, writer}. By using thread types, we can define properties for specific types regardless of the number of threads spawned for a given type. In order to assign a type to a thread in practice, we distinguish a set of actions S ⊆ A called "spawn" actions. For example in *readers-writers*, we can assign the spawn action of a reader (resp. writer) to be the method invocation of Reader.run (Writer.run). Function spawn : S → T, assigns a thread type to a spawn action. The threads that match a given type are determined based on the spawn action(s) present during the execution. We note that a thread can have multiple types. To reference all threads assigned a given type, we use function pool : T → 2 TID. That is, given a type t, a thread with *threadid* tid, we have tid ∈ pool(t) iff ∃a ∈ S : spawn(a) = t ∧ a.threadid = tid. This allows a thread to have multiple types so that properties operate on different events in the same thread.

Events As properties are defined over events, actions are typically abstracted into events. As such, we define for each thread type t ∈ T, the alphabet of events: Et. Set E<sup>t</sup> contains all the events that can be generated from actions for the particular thread type t ∈ T. The empty event E is a special event that indicates that no events are matched. Then, we assume a total function ev<sup>t</sup> : A → {E} ∪ Et. The implementation of ev relies on the specification formalism used, it is capable of generating events based on the context of the action itself. For example, the conversion can utilize the runtime context of actions to generate parametric events when needed. We illustrate a function ev that matches using the label of an action in Ex. 2.

*Example 2 (Events.).* We identify for *readers-writers* (Ex. 1) two thread types: Trw def = {reader, writer}. We are interested in the events Ereader def = {read}, and Ewriter def = {write}. For a specification at the level of a given thread, we have either a reader or a writer, and the event associated with the reader (resp. writer) is read (resp. write).

$$\text{ev}\_{\text{reader}}(\mathbf{a}) \stackrel{\text{def}}{=} \begin{cases} \text{read} & \text{if } \text{a.lbl} = \text{"r"},\\ \mathcal{E} & \text{otherwise} \end{cases} \qquad \text{ev}\_{\text{witter}}(\mathbf{a}) \stackrel{\text{def}}{=} \begin{cases} \text{write} & \text{if } \text{a.lbl} = \text{"w"},\\ \mathcal{E} & \text{otherwise}. \end{cases}$$

### 3.2 Scopes: Properties Over Concurrent Regions

We now define the notion of *scope*. A scope defines a projection of the concurrent execution to delimit concurrent regions and allow verification to be performed at the level of regions instead of the entire execution.

Synchronizing Actions A scope s is associated with a synchronizing predicate sync<sup>s</sup> : A → B<sup>2</sup> which is used to determine *synchronizing actions* (SAs). The set of synchronizing actions for a scope s is defined as: SA<sup>s</sup> = {a ∈ A | sync<sup>s</sup> (a) = >}. SAs constitute synchronization points in a concurrent execution for multiple threads. A valid set of SAs is such that there exists a total order on all actions in the set (i.e., no two SAs can occur concurrently). As such SAs are sequenced and can be mapped to indices. Function idx<sup>s</sup> : SA<sup>s</sup> → N\{0} returns the index of a synchronizing action. For convenience, we map them starting at 1, as 0 will indicate the initial state. We denote by |idxs| the length of the sequence.

Scope Region A scope region selects actions of the concurrent execution delimited by two successive SAs. We define two "special" synchronizing actions: begin, end ∈ A common to all scopes that are needed to evaluate the first and last region. The actions refer to the beginning and end of the concurrent execution, respectively.

Definition 5 (Scope Regions). *Given a scope* s *and an associated index function* idx<sup>s</sup> : SA<sup>s</sup> → N\{0}*, the scope regions are given by function* R<sup>s</sup> : codom(idxs)∪{0, |idxs|+ 1} → 2 <sup>A</sup>*, defined as:*

$$
\mathcal{R}\_{\mathtt{s}}(i) \stackrel{\text{def}}{=} \begin{cases}
\{a \in \mathbb{A} \mid \langle a', a \rangle \in \to \land \langle a, a'' \rangle \in \to \land \text{issync}(a', i - 1) \quad \text{if } 1 \le i \le |\text{idx}\_{\mathtt{s}}|, \\
\land \text{issync}(a'', i) \} \\
\{a \in \mathbb{A} \mid \langle a', a \rangle \in \to \land \langle a, \text{end} \rangle \in \to \land \text{issync}(a', i - 1) \} & \text{if } i = |\text{idx}\_{\mathtt{s}}| + 1, \\
\{a \in \mathbb{A} \mid \langle \text{begin}, a \rangle \in \to \land \langle a, a'' \rangle \in \to \land \text{issync}(a'', 1) \} & \text{if } i = 0, \\
\emptyset & \text{otherwise}
\end{cases}
$$

*where:* issync(a, i) def = (sync<sup>s</sup> (a) = > ∧ idxs(a) = i)*.*

Rs(i) is the i-th scope region, the set of all actions that happened between the two synchronizing actions a and a 0 , where idxs(a) = i and idxs(a 0 ) = i+1 taking into account the start and end of a program execution (i.e., actions begin and end, respectively).

*Example 3 (Scope regions).* For *readers-writers* (Ex. 1), we consider the resource service lock (s) to be the one of interest, as it delimits the concurrent regions that allow either a writer to write or readers to read. We label the scope by res for the remainder of the paper. The synchronizing predicate syncres selects all actions with label l (lock acquire) and with the lock id s present in the context of the action. The obtained sequence of SAs is 0.l 0 s · 1.l 1 s · 2.l 0 s . The value of idxres for each of the obtained SAs is respectively 1, 2, and 3. Every lock acquire delimits the regions of the concurrent execution. The region k + 1 includes all actions between the two lock acquires 0.l 0 s and 1.l 1 s . That is, Rres(k + 1) = {0.w 0 , 0.u 0 s , 0.u 0 t , 1.l 1 t , 0.l 1 c , 0.i <sup>1</sup>}. The region k + 2 contains two concurrent reads: r 1 , r 2 .

Definition 6 (Scope fragment). *The scope fragment associated with a scope region* Rs(i) *is defined as* Fs(i) def = hRs(i), → ∩ Rs(i) × Rs(i)i*.*

Proposition 1 (Scope fragment preserves order). *Given a scope* s*, we have:* ∀i ∈ dom(Rs(i)) : snd(hA, →i, Fs(i)) ∧ faith(hA, →i, Fs(i))*.*

Proposition 1 states that for a given scope, any fragment (obtained using Fs) is a sound and faithful trace of the concurrent execution. This is ensured by construction using Definitions 5 and 6 which follow the same principles of the definitions of soundness (Definition 3) and faithfulness (Definition 4).

*Remark 1.* In this paper, scopes regions are defined by the user by selecting the synchronizing predicate as part of the specification. Given a property, regions should delimit events whose order is important for a property. For instance, for a property specifying that *"between each write, at least one read should occur"*, the scope regions should delimit read versus write events. Delimiting the read events themselves, performed by

Fig. 3: Projected actions using the scope and local properties of 1-Writer 2-Readers. The action labels l, w,r indicate respectively the following: lock, write, and read. Filled actions indicate actions for which function ev for the thread type returns an event. Actions with a pattern background indicate the SAs for the scope.

different threads, is not significant. How to analyze the program to find and suggest scopes for the user that are suitable for monitoring a given property is an interesting challenge that we leave for future work. Moreover, we assume the program is properly synchronized and free from data races.

Local Properties In a given scope region, we determine properties that will be checked locally on each thread. A thread-local monitor checks a local property independently for each given thread. These properties can be seen as the analogous of *per-thread* monitoring applied between two SAs. For a specific thread, we have a guaranteed total order on the local actions being formed. This ensures that local properties are compatible and can be checked with existing RV techniques and formalisms. We refer to those properties as *local properties*.

### Definition 7 (Local property). *A local property is a tuple* htype,EVS, RT, evali *with:*


We use the dot notation: for a given property prop = htype,EVS, RT, evali we use prop.type, prop.EVS, prop.RT, and prop.eval respectively.

*Example 4 (At least one read).* The property "at least one read", defined for the thread type reader, states that a reader must perform at least one read event. It can be expressed using the classical LTL<sup>3</sup> [10] (a variant of linear temporal logic with finitetrace semantics commonly used in RV) as ϕ1r def = F(read) using the set of atomic propositions {read}. Let LTL<sup>3</sup> AP <sup>ϕ</sup> denote the evaluation function of LTL<sup>3</sup> using the set of atomic propositions AP and a formula ϕ, and let B<sup>3</sup> = {>, ⊥, ?} be the truth domain where ? denotes an inconclusive verdict. To check on readers, we specify it as the local property: hreader, {read}, B3, LTL<sup>3</sup> {read} ϕ1r i. Similarly, we can define the local specification for at least one write.

Scope Trace To evaluate a local property, we restrict the trace to the actions of a given thread contained within a scope region. A scope trace is analogous to acquiring the trace for *per-thread* monitoring [5, 30] in a given scope region (see Definition 5). The scope trace is defined as a projection of the concurrent execution, on a specific thread, selecting actions that fall between two synchronizing actions.

Definition 8 (Scope trace). *Given a local property* p = htype,EVS, RT, evali *in a scope region* R<sup>s</sup> *with index* i*, a* scope trace *is obtained using the projection function* proj*, which outputs the sequence of actions of length* n *for a given thread with* tid ∈ TID *that are associated with events for the property. We have:* ∀` ∈ [0, n]

$$\begin{aligned} \text{proj}(\text{tid}, i, \mathbf{p}, \mathcal{R}\_{\mathbf{s}}) & \stackrel{\text{def}}{=} \begin{cases} \text{filter}(a\_0) \cdot \dots \cdot \text{filter}(a\_n) & \text{if } i \in \text{dom}(\mathcal{R}\_{\mathbf{s}}) \wedge \text{tid} \in \text{pool}(\text{type}), \\ \mathcal{E} & \text{otherwise}, \end{cases} \\ \text{with: } \text{filter}(a\_\ell) & \stackrel{\text{def}}{=} \begin{cases} e & \text{if } \text{ev}\_{\text{type}}(a\_\ell) \in \text{EVS} \\ \mathcal{E} & \text{otherwise}, \end{cases} \end{aligned}$$

*where* · *is the sequence concatenation operator (such that* a · E = E · a = a*), with* (∀j ∈ [1, n] : haj−1, a<sup>j</sup> i ∈→) ∧ (∀k ∈ [0, n] : a<sup>k</sup> ∈ Rs(i)∧ ak.threadid = tid)*.*

For a given thread, the scope trace filters the actions associated with an event for the local property (i.e., evtype(a`) ∈ EVS) of a scope region. It includes only actions that are associated with the *threadid* that has the correct type associated with the local specification (i.e., tid ∈ pool(type)). While the scope trace is obtained using projection, it is still needed to convert actions to events to later evaluate local properties, to do so we generate the sequence of events associated with the actions in the projected trace. That is, for a given action a in the sequence, we output evtype(a`), we denote the generated sequence as evs(proj(tid, i, p, Rs)).

*Example 5 (Scope trace).* Figure 3 illustrates the projection on the scope regions defined using the resource lock (Ex. 3) for each of the 1 writer and 2 reader threads, where the properties "at least one write" or "at least one read" (Example 4) apply. We see the scope traces for region k + 1 are respectively 0.w 0 , E, E for the threads with thread ids 0, 1, and 2 respectively. For that region, we can now evaluate the local specification independently for each thread on the resulting traces by converting the sequences of events: write, E, E for each of the scope traces.

Proposition 2 (proj preserves per-thread order). *Given a scope* s*, a thread with threadid* tid*, and a local property* p*, we have:* ∀i ∈ dom(Rs) : snd (hA, →i, proj(tid, i, p, Rs)) ∧ faith (hA, →i, proj(tid, i, p, Rs))*.*

Proposition 2 is guaranteed by construction (from Definition 8), ensuring that projection function proj does not produce any new actions and does not change any order information from the point of view of a given thread. We also note the assumption that for a single thread, all its actions are totally ordered, and therefore we capture all possible order information for the actions in the scope region. Finally, the function filter only suppresses actions that are not relevant to the property, without adding or re-ordering actions. The sequence of events obtained using the function evs also follows the same order.

Scope State A scope state aggregates the result of evaluating all local properties for a given scope region. To define a scope state, we consider a scope s, with a list of local properties hprop<sup>0</sup> , . . . , propni of return types respectively hRT0, . . . , RTni. Since a local specification can apply to an arbitrary number of threads during the execution, for each specification we create the type as a dictionary binding a *threadid* to the return type (represented as a total function). We use the type na to determine a special type indicating the property does not apply to the thread (as the thread type does not match the property). We can now define the return type of evaluating all local properties as RI def = hTID → {na} ∪ RT0, . . . , TID → {na} ∪ RTni. Function state<sup>s</sup> : RI → I<sup>s</sup> processes the result of evaluating local properties to create a scope state in Is.

*Example 6 (Scope state).* We illustrate the scope state by evaluating the properties "at least one read" (pr) and "at least one write" (pw) (Ex. 4) on scope region k + 2 in Fig. 3. We have TID = {0, 1, 2}, we determine for each reader the trace (being (read) for both), and the writer being empty (i.e. no write was observed). As such for property p<sup>r</sup> (resp. pw), we have the result of the evaluation [0 7→ na, 1 7→ >, 2 7→ >] (resp. [0 7→ ?, 1 7→ na, 2 7→ na]). We notice that for property pr, the thread of type writer evaluates to na, as it is not concerned with the property.

We now consider the state creation function states. We consider the following atomic propositions activereader, activewriter, allreaders, and onewriter that indicate respectively: at least one thread of type reader performed a read, at least one thread of type writer performed a write, all threads of type reader (|pool(reader)|) performed at least a read, and at most one thread of type writer performed a write. The scope state in this case is a list of 4 boolean values indicating each atomic proposition respectively. As such by counting the number of threads associated with >, we can compute the Boolean value of each atomic proposition. For region k + 2, we have the following state: h>, ⊥, >, ⊥i. We can establish a total order of scope states. For k + 1, k + 2 and k + 3, we have the sequence h⊥, >, ⊥, >i · h>, ⊥, >, ⊥i · h⊥, >, ⊥, >i.

We are now able to define formally a scope by associating an identifier with a synchronizing predicate, a list of local properties, a spawn predicate, and a scope property evaluation function. We denote by SID the set of scope identifiers.

Definition 9 (Scope). *A scope is a tuple* hsid,syncsid,hprop<sup>1</sup> , . . . , propni,statesid, sevalsidi*, where:*


### 3.3 Semantics for Evaluating Scopes

After defining scope states, we are now able to evaluate properties on the scope. To evaluate a scope property, we first evaluate each local property for the scope region, we then use statesid to generate the scope state for the region. After producing the sequence of scope states, the function sevalsid evaluates the property at the level of a scope.

Definition 10 (Evaluating a scope property). *Using the synchronizing predicate* syncsid*, we obtain the regions* Rsid(i) *for* i ∈ [0, m] *with* m = |idxsid| + 1*. The evaluation of a scope property (noted* res*) for the scope* hsid,syncsid,hprop<sup>0</sup> , . . . , propni,statesid, sevalsidi *is computed as:* ∀tid ∈ TID, ∀j ∈ [0, n]

res = sevalsid(SR<sup>0</sup> · . . . · SRm), *where* SR<sup>i</sup> = statesid(hLR<sup>i</sup> 0 , . . . , LR<sup>i</sup> n i)

LR<sup>i</sup> <sup>j</sup> = tid 7→ prop<sup>j</sup> .eval(evs(proj(tid, i, prop<sup>j</sup> , Rsid))) *if* tid ∈ pool(prop<sup>j</sup> .type) tid 7→ na *otherwise*

*Example 7 (Evaluating scope properties).* We use LTL to formalize three scope properties based on the scope states from Ex. 6 operating on the alphabet {activereader, activewriter, allreaders, onewriter}:


Therefore the specification is: G(ϕ<sup>0</sup> ∧ ϕ<sup>1</sup> ∧ ϕ2). We recall that a scope state is a list of boolean values for the atomic propositions in the following order: activereader, activewriter, allreaders, and onewriter. The sequence of scope states from Ex. 6: h⊥, >, ⊥, >i · h>, ⊥, >, ⊥i · h⊥, >, ⊥, >i complies with the specification.

*Correctness of Scope Evaluation* We assume that the SAs selected by the user in the specification are totally ordered. This ensures that the order of the scope states is a total order, it is then by assumption sound and faithful to the order of the SAs. However, it is important to ensure that the actions needed to construct the state are captured faithfully and in a sound manner. We capture the partial order as follows: (1) actions of different threads are captured in a sound and faithful manner between two successive SAs (Proposition 1), and (2) actions of the same thread are captured in a sound and faithful manner for that thread (Proposition 2). Furthermore, we are guaranteed by Definition 10 that each local property evaluation function is passed to all actions relevant to the given thread (and no other). As such, for the granularity level of the SAs, we obtain all relevant order information.

*Evaluating without resetting.* We notice that in Definition 10 monitors on local properties are reset for each concurrency region. As such, they are unable to express properties that span multiple concurrency regions of the same thread. The semantics of function res conceptually focus on treating concurrency regions independently. However, we can account for elaborating the expressiveness of local properties by extending the alphabet for each local property with the atomic proposition sync which delimits the concurrency region. The proposition sync denotes that the scope synchronizing action has occurred, and adds it to the trace. We need to take careful consideration that threads may sleep and not receive any events during a concurrent region. For example, consider two threads waiting on a lock, when one thread gets the lock, the other will not. As such, to pass the sync event to the local specification of the sleeping thread requires we instrument very

Fig. 4: Example of a scope channel for 1-Writer 2-Readers.

intrusively to account for that, a requirement we do not want to impose. Therefore, we add the restriction that local properties are only evaluated if at least one event relevant to the local property is encountered in the concurrency region (that is not the synchronization event). Using that consideration, we can define an evaluation that considers all events starting from concurrent region 0 up to i, and adding sync events between scopes (we omit the definition for brevity). This allows local monitors to account for synchronization, either to reset or check more expressive specifications such as "*a reader can read at most* n *times every* m *concurrency regions*", and "*writers must always write a value that is greater than the last write*".

### 3.4 Communicating Verdicts and Monitoring

We now proceed to describe how the monitors communicate their verdicts.

*Scope channel.* The *scope channel* stores information about the scope states during the execution. We associate each scope with a scope channel that has its own timestamp. The channel provides each thread-local monitor with an exclusive memory slot to write its result when evaluating local properties. Each thread can only write to its associated slot in the channel. The timestamp of the channel is readable by all threads participating in the scope but is only incremented by the scope monitor, as we will see.

*Example 8 (Scope channel).* Figure 4 displays the channel associated with the scope monitoring discussed in Ex. 6. For each scope region, the channel allows each monitor an exclusive memory slot to write its result (if the thread is not sleeping). The slots marked with a dash (-) indicate the absence of monitors. Furthermore, na indicates that the thread was given a slot, but it did not write anything in it (see Definition 10).

For a timestamp t, local monitors no longer write any information for any scope state with a timestamp inferior to t, this makes such states always consistent to be read by any monitor associated with the scope. While this is not in the scope of the paper, it allows monitors to effectively access past data of other monitors consistently.

*Thread-local monitors.* Each thread-local monitor is responsible for monitoring a local property for a given thread. Recall that each thread is associated with an identifier and a type. Multiple such monitors can exist on a given thread, depending on the needed properties to check. These monitors are spawned on the creation of the thread. It receives an event, performs checking, and can write its result in its associated scope channel at the current timestamp.

*Scope monitors.* Scope monitors are responsible for checking the property at the level of the scope. Upon reaching a synchronizing action by any of the threads associated with the scope, the given thread will invoke the scope monitor. The scope monitor relies on the scope channel (shared among all threads) to have access to all observations. Additional memory can be allocated for its own state, but it has to be shared among all threads associated with the scope. The scope monitor is invoked atomically after reaching the scope synchronizing action. First, it constructs the scope state based on the results of the thread-local monitors stored in the scope channel. Second, it invokes the verification procedure on the generated state. Finally, before completing, it increments the timestamp associated with the scope channel.

# 4 Preliminary Assessment of Overhead

We first opportunistically monitor *readers-writers*, using the specification found in Ex. 7. We then demonstrate our approach with classical concurrent programs<sup>2</sup> .

### 4.1 Readers-Writers

*Experiment setup.* For this experiment, we utilize the standard LTL<sup>3</sup> semantics defined over the B<sup>3</sup> verdict domain. As such, all the local and scope property types are B3. We instrument *readers-writers* to insert our monitors and compare our approach to global monitoring using a custom aspect written in AspectJ. In total, we have three scenarios: non-monitored, global, and opportunistic. In the first scenario (non-monitored), we do not perform monitoring. In the second and third scenarios, we perform global and opportunistic monitoring. We recall that global monitoring introduces additional locks at the level of the monitor for all events that occur concurrently. We make sure that the program is well synchronized and data race free with RVPredict [37].

*Measures.* To evaluate the overhead of our approach, we are interested in defining parameters to characterize concurrency regions found in *readers-writers*. We identify two parameters: the *number of readers* (nreaders), and the *width of the concurrency region* (cwidth). On the one hand, nreaders determines the maximum parallel threads that are verifying local properties in a given concurrency region. On the other hand, cwidth determines the number of reads each reader performs concurrently when acquiring the lock. Parameter cwidth is measured in number of read events generated. By increasing the size of the concurrency regions, we increase lock contention when multiple concurrent events cause a global monitor to lock. We use a number of writers equivalent to nreaders ∈ {1, 3, 7, 15, 23, 31, 63, 127} and cwidth ∈ {1, 5, 10, 15, 30, 60, 100, 150}.

<sup>2</sup> The artifact for this paper is available [56].

Fig. 5: Execution time for *readers-writers* for non-monitored, global, and opportunistic monitoring when varying the number of readers.

We perform a total of 100,000 writes and 400,000 reads, where reads are distributed evenly across readers. We measure the execution time (in ms) of 50 runs of the program for each of the parameters and scenarios.

*Preliminary results.* We report the results using the averages while providing the scatter plots with linear regression curves in Figures 5, and 6. Figure 5 shows the overhead when varying the number of readers (nreaders). We notice that for the base program (non-monitored), the execution time increases as lock contention overhead becomes more prominent and the JVM is managing more threads. In the case of global monitoring, as expected we notice an increasing overhead with the increase in the number of threads. As more readers are executing, the program is being blocked on each read which is supposed to be concurrent. For opportunistic, we notice a stable runtime in comparison to the original program as no additional locks are being used; only the delay to evaluate the local and scope properties. Figure 6 shows the overhead when varying the width of the concurrency region (cwidth). We observe that for the base program, the execution time decreases as more reads can be performed concurrently without contention on the shared resource lock. In the case of global monitoring, we also notice a slight decrease, while for opportunistic monitoring, we see a much greater decrease. By increasing the number of concurrent events in a concurrency region, we

Fig. 6: Execution time varying the number of events in the concurrency region.

highlight the overhead introduced by locking the global monitor. We recall that a global monitor must lock to linearize the trace, and as such interferes with concurrency. This can be seen by looking at the two curves for global and opportunistic monitoring, we see that opportunistic closely follows the speedup of the non-monitored program, while global monitoring is much slower. For opportunistic monitoring, we expected a positive performance payoff when events in concurrency regions are dense.

### 4.2 Other Benchmarks

We target classical benchmarks that use different concurrency primitives to synchronize threads. We perform global and opportunistic monitoring and report our results using the averages of 100 runs in Figure 7. We use an implementation of the Bakery lock algorithm [39], for two threads *2-bakery* and n threads *n-bakery*. The algorithm performs synchronization using reads and writes on shared variables and guarantees mutual exclusion on the critical section. As such, we monitor the program for the *bounded waiting* property which specifies that a process should not wait for more than a limited number of turns before entering the critical section. For opportunistic monitoring, thread-local monitors are deployed on each thread to monitor if the thread acquires the critical section. Scope monitors check if a thread is waiting for more than n turns before entering the critical section. We notice slightly less overhead with opportunistic than global for

Fig. 7: Execution time of benchmarks.

*2-bakery* and more overhead with opportunistic on *n-bakery*. This is because of the small concurrency region (cwidth) which is equal to 1. As such, the overhead of evaluating local and scope monitors by several threads, having a cwidth of 1, exceeds the gain in performance achieved by our approach and hence not fitting for opportunistic monitoring.

We also monitor a textbook example of Ping-Pong algorithm [33] that is used for instance in databases and routing protocols. The algorithm synchronizes, using reads and writes on shared variables and busy waiting, between two threads producing events pi for the pinging thread and po for the pong thread. We monitor for the *alternation* property specified as ϕ def = (ping =⇒ Xpong) ∧ (pong =⇒ Xping). We also include a classic producer-consumer program from [35] which uses a concurrent FIFO queue using locks and conditions. We monitor the *precedence* property, which specifies the requirement that a consume (event c) is preceded by a produce (event p), expressed in LTL as ¬cWp. For both above benchmarks, we observe less overhead when monitoring with opportunistic, since no additional locks are being enforced on the execution.

We also monitor a parallel mergesort algorithm which is a divide-and-conquer algorithm to sort an array. The algorithm uses the fork-join framework [41] which recursively splits the array into sorting tasks that are handled by different threads. We are interested in monitoring if a forked task is returning a correctly sorted array before performing a merge. The monitoring step is expensive and linear in the size of the array as it involves scanning it. For opportunistic, we use the joining of two subtasks as our synchronizing action and deploy scope monitors at all levels of the recursive hierarchy. We observe less overhead when monitoring with opportunistic than global monitoring, as concurrent threads do not have to wait at each monitoring step. This benchmark motivates us to further investigate other hierarchical models of computation where opportunistic RV can be used such as [22].

### 5 Related Work

We focus here on techniques developed for the verification of behavioral properties of multithreaded programs written in Java and refer to [12] for a detailed survey on tools covering generic concurrency errors. The techniques we cover typically analyze a trace to either *detect* or *predict* violations.

Java-MOP [18], Tracematches [5,13], MarQ [51], and LARVA [21] chosen from the RV competitions [8,26,52] are runtime monitoring tools for violation detection. These tools allow different specification formalisms such as finite-state machines, extended regular expressions, context-free grammars, past-time linear temporal logic, and Quantified Event Automata (QEA) [6]. Their specifications rely on a total order of events and require that a collected trace be linearized. They were initially developed to monitor single-threaded programs and later adapted to monitor multithreaded programs. As mentioned, to monitor global properties spanning multiple threads these techniques impose a lock on each event blocking concurrent regions in the program and forcing threads to synchronize. Moreover, they often produce inconsistent verdicts with the existence of concurrent events [23]. EnforceMOP [44] for instance, can be used to detect and enforce properties (deadlocks as well). It controls the runtime scheduler and blocks the execution of threads that might cause a property violation, sometimes itself leading to a deadlock.

Predictive techniques [19, 31, 38, 54] reason about all feasible interleavings from a recorded trace of a single execution. As such, they need to establish the causal ordering between the actions of the program. These tools implement vector clock algorithms, such as [53], to timestamp events. The algorithm blocks the execution on each property event and also on all synchronizing actions such as reads and writes. Vector clock algorithms typically require synchronization between the instrumentation, program actions, and algorithm's processing to avoid data races [16]. jPredictor [19] for instance, uses sliced causality [17] to prune the partial order such that only relevant synchronization actions are kept. This is achieved with the help of static analysis and after recording at least one execution of the program. The tool is demonstrated on atomicity violations and data races; however, we are not aware of an application in the context of generic behavioral properties. RVPredict [37] develops a sound and maximal causal model to analyze concurrency in a multithreaded program. The correct behavior of a program is modeled as a set of logical constraints, thus restricting the possible traces to consider. Traces are ordered permutations containing both control flow operations and memory accesses and are constrained by axioms tailored to data race and sequential consistency. The theory supports any logical constraints to determine correctness, it is then possible to encode a specification on multithreaded programs as such. However, allowing for arbitrary specifications to be encoded while supported in the model, is not supported in the provided tool (RVPredict). In [27], the authors present ExceptioNULL that target nullpointer exceptions. Violations and causality are represented as constraints over actions, and the feasibility of violations is explored via an SMT constraint solver. GPredict [36] extends the specification formalism past data races to target generic concurrency properties. GPredict presents a generic approach to reason about behavioral properties and hence constitutes a monitoring solution when concurrency is present. Notably, GPredict requires specifying thread identifiers explicitly in the specification. This makes specifications with multiple threads to become extremely verbose; unable to handle a dynamic number of threads. For example, in the case of *readers-writers*, adding extra readers or writers requires rewriting the specification and combining events to specify each new thread. The approach behind GPredict can also be extended to become more expressive, e.g. to support counting events to account for fairness in a concurrent setting. Furthermore, GPredict relies on recording a trace of a program before performing an offline analysis to determine concurrency errors [36]. In addition to being incomplete due to the possibility of not getting results from the constraint solver, the analysis from GPredict might also miss some order relations between events resulting in false positives. In general, the presented predictive tools are often designed to be used offline and unfortunately, many of them are no longer maintained.

In [14,15], the authors present monitoring for *hyperproperties* written in alternationfree fragments of HyperLTL [20]. Hyperproperties are specified over sets of execution traces instead of a single trace. In our setup, each thread is producing its trace and thus scope properties we monitor can be expressed in HyperLTL for instance. The time occurrence of events will be delimited by concurrency regions and thus traces will consist of propositions that summarize the concurrency region. We have yet to explore the applicability of specifying and monitoring hyperproperties within our opportunistic approach.

# 6 Conclusion and Perspectives

We introduced a generic approach for the online monitoring of multithreaded programs. Our approach distinguishes between thread-local properties and properties that span concurrency regions referred to as scopes (both types of properties can be monitored with existing tools). Our approach relies heavily on existing totally ordered operations in the program. However, by utilizing the existing synchronization, we can monitor online while leveraging both existing per-thread and global monitoring techniques. Finally, our preliminary evaluation suggests that opportunistic monitoring incurs a lower overhead in general than classical monitoring.

While the preliminary results are promising, additional work needs to be invested to complete the automatic synthesis and instrumentation of monitors. So far, splitting the property over local and scope monitors is achieved manually and scope regions are guaranteed by the user to follow a total order. Analyzing the program to find and suggest scopes suitable for splitting and monitoring a given property is an interesting challenge that we leave for future work. The program can be run, for instance, to capture its causality and recommend suitable synchronization actions for delimiting scope regions. Furthermore, the expressiveness of the specification can be increased by extending scopes to contain other scopes and adding more levels of monitors. This allows for properties that target not just thread-local properties, but also concurrent regions enclosed in other concurrent regions, thus creating a hierarchical setting.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Parallel Program Analysis via Range Splitting

Jan Haltermann1() ? , Marie-Christine Jakobs<sup>2</sup> , Cedric Richter<sup>1</sup> , and Heike Wehrheim<sup>1</sup>

<sup>1</sup> University of Oldenburg, Department of Computing Science, Oldenburg, Germany {jan.haltermann,cedric.richter,heike.wehrheim}@uol.de

<sup>2</sup> Technical University of Darmstadt, Computer Science, Darmstadt, Germany jakobs@cs.tu-darmstadt.de

Abstract. Ranged symbolic execution has been proposed as a way of scaling symbolic execution by splitting the task of path exploration onto several workers running in parallel. The split is conducted along path ranges which – simply speaking – describe sets of paths. Workers can then explore path ranges in parallel.

In this paper, we propose ranged analysis as the generalization of ranged symbolic execution to arbitrary program analyses. This allows us to not only parallelize a single analysis, but also run different analyses on different ranges of a program in parallel. Besides this generalization, we also provide a novel range splitting strategy operating along loop bounds, complementing the existing random strategy of the original proposal. We implemented ranged analysis within the tool CPAchecker and evaluated it on programs from the SV-COMP benchmark. The evaluation in particular shows the superiority of loop bounds splitting over random splitting. We furthermore find that compositions of ranged analyses can solve analysis tasks that none of the constituent analysis alone can solve.

Keywords: Ranged Symbolic Execution, Cooperative Software Verification, Parallel Configurable Program Analysis

# 1 Introduction

Recent years have seen enormous progress in automatic software verification, driven amongst others by annual competitions like SV-COMP [13]. Software verification tools employ a bunch of different techniques for analysis, like predicate analysis, bounded model checking, k-induction, property-directed reachability, or automata-based methods. As however none of these techniques is superior over the others, today often a form of cooperative verification [24] is employed. The idea of cooperative verification is to have different sorts of analyses cooperate on the task of software verification. This principle has already been implemented in various forms [16,19,33,59], in particular also as cooperations of testing and verification tools [10,39,41,42]. Such cooperations most often take the form of se-

<sup>?</sup> This author was partially supported by the German Research Foundation (DFG) — WE2290/13-1 (Coop).

quential combinations, where one tool starts with the full task, stores its partial analysis result within some verification artefact, and the next tool then works on the remaining task. In contrast, parallel execution of different tools is in the majority of cases only done by portfolio approaches, simply running the different tools on the same task in parallel. One reason for using portfolios when employing parallel execution is the fact that it is unclear how to best split a program into parts on which different tools could work separately while still being able to join their partial results into one for the entire program.

With ranged symbolic execution, Siddiqui and Khurshid [86] proposed one such technique for splitting programs into parts. The idea of ranged symbolic execution is to scale symbolic execution by splitting path exploration onto several workers, thereby, in particular allowing the workers to operate in parallel. To this end, they defined so-called path ranges. A path range describes a set of program paths defined by two inputs to the program, where the path π<sup>1</sup> triggered by the first input is the lower bound and the path π<sup>2</sup> for the second input is the upper bound of the range. All paths in between, i.e., paths π such that π<sup>1</sup> ≤ π ≤ π<sup>2</sup> (based on some ordering ≤ on paths), make up a range. A worker operating on a range performs symbolic execution on paths of the range only. In their experiments, Siddiqui and Khurshid investigated one form of splitting via path ranges, namely by randomly generating inputs, which then make up a number of ranges.

In this paper, we generalize ranged symbolic execution to arbitrary analyses. In particular, we introduce the concept of a ranged analysis to execute an arbitrary analysis on a given range and compose different ranged analyses, which can then operate on different ranges in parallel. Also, we propose a novel splitting strategy, which generates ranges along loop bounds. We implemented ranged analysis in the software verification tool CPAchecker [21], which already provides a number of analyses, all defined as configurable program analyses (CPAs). To integrate ranged analysis in CPAchecker, we defined a new range reduction CPA, and then employed the built-in feature of analysis composition to combine it with different analyses. The thus obtained ranged analyses are then run on different ranges in parallel, using CoVeriTeam [20] as tool for orchestration. We furthermore implemented two strategies for generating path ranges, our novel strategy employing loop bounds for defining ranges plus the original random splitting technique. A loop bound n splits program paths into ranges only entering the loop at most n times and ranges entering for more than n times<sup>3</sup> .

Our evaluation on SV-COMP benchmarks [36] first of all confirms the results of Siddiqui and Khurshid [86] in that symbolic execution benefits from a ranged execution. Second, our results show that a loop-bound based splitting strategy brings an improvement over random splitting. Finally, we see that a composition of ranged analyses can solve analysis tasks that none of the (different) constituent analyses of a combination can solve alone.

<sup>3</sup> Such splits can also be performed on intervals on loop bounds, thereby generating more than two path ranges.

Fig. 1: Example program mid (taken from [86]) and its CFA

# 2 Background

We start by introducing some notations on programs, defining path ranges, and introducing configurable program analysis as implemented in CPAchecker.

### 2.1 Program Syntax and Semantics

For the sake of presentation, we consider simple, imperative programs with a deterministic control-flow with one sort of variables (from some set V) only<sup>4</sup> . Formally, we model a program by a control-flow automaton (CFA) P = (L, `0, G), where L ⊆ Loc is a subset of the program locations Loc (the program counter values), `<sup>0</sup> ∈ L represents the beginning of the program, and control-flow edges G ⊆ L×Ops×L describe when which statements may be executed. Therein the set of statements Ops contains all possible statements, e.g., assume statements (boolean expressions over variables V, denoted by BExpr ), assignments, etc. We expect that CFAs originate from program code and, thus, control-flow may only branch at assume operations, i.e., CFAs P = (L, `0, G) are deterministic in the following sense. For all (`, op<sup>0</sup> , `0 ),(`, op<sup>00</sup>, `<sup>00</sup>) ∈ G either op<sup>0</sup> = op<sup>00</sup> ∧ ` <sup>0</sup> = ` <sup>00</sup> or op<sup>0</sup> , op<sup>00</sup> are assume operations and op<sup>0</sup> ≡ ¬(op<sup>00</sup>). We assume that there exists an indicator function B<sup>P</sup> : G → {T, F, N} that reports the branch direction, either N(one), T(rue), or F(alse). This indicator function assigns N to all edges without assume operations and for any two assume operations (`, op<sup>0</sup> , `0 ),(`, op<sup>00</sup>, `<sup>00</sup>) ∈ G with op<sup>0</sup> 6= op<sup>00</sup> it guarantees B<sup>P</sup> ((`, op<sup>0</sup> , `0 )) ∪ B<sup>P</sup> ((`, op<sup>00</sup>, `<sup>00</sup>)) = {T, F}. Since CFAs are typically derived from programs and assume operations correspond to the two evaluations of conditions of e.g., if or while statements, the assume operation representing the true evaluation of the condition is typically assigned T. We will later need this indicator function for defining path orderings.

Figure 1 shows our example program mid, which returns the middle value of the three input values, and its CFA. For each condition of an if statement it contains one assume edge for each evaluation of the condition, namely solid edges labelled by the condition for entering the if branch after the condition evaluates

<sup>4</sup> Our implementation supports C programs.

to true and dashed edges labelled by the negated condition for entering the else branch after the condition evaluates to false, i.e., the negated condition evaluates to true. All other statements are represented by a single edge.

We continue with the operational semantics of programs. A program state is a pair (`, c) of a program location ` ∈ L and a data state c from the set C of data states that assign to each variable v ∈ V a value of the variable's domain. Program execution paths π = (`0, c0) − <sup>g</sup><sup>1</sup>−→ (`1, c1) − <sup>g</sup><sup>2</sup>−→ . . . − <sup>g</sup><sup>n</sup>−→ (`n, cn) are sequences of states and edges such that (1) they start at the beginning of the program and (2) only perform valid execution steps that (a) adhere to the control-flow, i.e., ∀1 ≤ i ≤ n : g<sup>i</sup> = (`i−1, ·, `i), and (b) properly describe the effect of the operations, i.e., ∀1 ≤ i ≤ n : c<sup>i</sup> = spop<sup>i</sup> (ci−1), where the strongest postcondition spop<sup>i</sup> : C \* C is a partial function modeling the effect of operation op<sup>i</sup> ∈ Ops on data states. Execution paths are also called feasible paths, and paths that fulfil properties (1) and (2a) but violate property (2b) are called infeasible paths. The set of all execution paths of a program P is denoted by paths(P).

### 2.2 Path Ordering, Execution Trees, and Ranges

Our ranged analysis analyses sets of consecutive program execution paths. To specify these sets, we first define an ordering on execution paths. Given two program paths π = (`0, c0) − <sup>g</sup><sup>1</sup>−→ (`1, c1) − <sup>g</sup><sup>2</sup>−→ . . . − <sup>g</sup><sup>n</sup>−→ (`n, cn) and π <sup>0</sup> = (` 0 0 , c0 0 ) − g 0 −→<sup>1</sup> (` 0 1 , c0 1 ) − g 0 −→<sup>2</sup> . . . − g 0 −→<sup>m</sup> (` 0 m, c<sup>0</sup> <sup>m</sup>) ∈ paths(P), we define their order ≤ based on their control-flow edges. More specifically, edges with assume operations representing a true evaluation of a condition are smaller than the edges representing the corresponding false evaluation of that condition. Following this idea, π ≤ π 0 if ∃ 0 ≤ k ≤ n : ∀ 1 ≤ i ≤ k : g<sup>i</sup> = g 0 <sup>i</sup> ∧ (n = k ∧ m ≥ n) ∨ (m > k ∧ n > k ∧B<sup>P</sup> (gk+1) = T ∧ B<sup>P</sup> (g 0 <sup>k</sup>+1) = F) . An execution tree is a tree containing all execution paths of a program with the previously defined ordering, where nodes are labelled with the assume operations.

Based on the above ordering, we now specify ranges, which describe sets of consecutive program execution paths analysed by a ranged analysis and which are characterized by a left and right path that limit the range. Hence, a range [π, π<sup>0</sup> ] is the set {π<sup>r</sup> ∈ paths(P) | π ≤ π<sup>r</sup> ≤ π 0} 5 . To easily describe ranges that are not bound on the left or right, we use the special paths π<sup>⊥</sup> , π<sup>&</sup>gt; ∈/ paths(P) which are smaller and greater than every path, i.e., ∀π ∈ paths(P) : (π ≤ π<sup>&</sup>gt; ) ∧ (π<sup>&</sup>gt; 6≤ π) ∧ (π<sup>⊥</sup> ≤ π) ∧ (π 6≤ π<sup>⊥</sup> ). Consequently, [π<sup>⊥</sup> , π<sup>&</sup>gt; ] = paths(P).

As the program is assumed to be deterministic except for the input, a test case τ , τ : V → Z, which maps each input variable to a concrete value, describes exactly a single path π 6 . We say that τ induces π and write this path as π<sup>τ</sup> . Consequently, we can define a range by two induced paths, i.e., as [π<sup>τ</sup><sup>1</sup> , π<sup>τ</sup><sup>2</sup> ] for test cases τ<sup>1</sup> and τ2. For the example program from Fig. 1, two example test cases are τ<sup>1</sup> = {x : 0, y : 2, z : 1} and τ<sup>2</sup> = {x : 1, y : 0, z : 2}. Two such induced

<sup>5</sup> In [86], the range is formalized as [π, π<sup>0</sup> ) but their implementation works on [π, π<sup>0</sup> ].

<sup>6</sup> More concretely, test input τ describes a single maximal path and all its prefixes.

path are πτ<sup>1</sup> = (`0, c1) − x<y −−→ (`1, c1) − !(y<z) −−−−→ (`4, c1) − x<z −−→ (`7, c1) − ret z −−→ (`11, c1), where c<sup>1</sup> = [x 7→ 0, y 7→ 2, z 7→ 1] and πτ<sup>2</sup> = (`0, c2) − !(x<y) −−−−→ (`2, c2) − x<z −−→ (`5, c2) − ret x −−−→ (`11, c2), where c<sup>2</sup> = [x 7→ 1, y 7→ 0, z 7→ 2].

### 2.3 Configurable Program Analysis

We will realize our ranged analysis using the configurable program analysis (CPA) framework [17]. This framework allows one to define customized, abstract-interpretation based analyses, i.e., it allows a selection of the abstract domain as well as a configuration for exploration. For the latter, one defines when and how to combine information and when to stop exploration. Formally, a CPA A = (D, , merge,stop) consists of

– the abstract domain <sup>D</sup> = (Loc <sup>×</sup> C,(E, <sup>&</sup>gt;, <sup>v</sup>, <sup>t</sup>), <sup>J</sup>·K), which is composed of a set Loc × C of program states, a join semi-lattice on the abstract states E as well as a concretization function, which fulfils that

$$\left[\top\right] = Loc \times C \text{ and } \forall e, e' \in E: \left[e\right] \cup \left[e'\right] \subseteq \left[e \sqcup e'\right].$$

– the transfer relation ⊆ E × G × E defining the abstract semantics that safely overapproximates the program semantics, i.e.,

$$\forall e \in E, \, g \in \mathcal{L}oc \times Ops \times Loc :$$

$$\{s' \mid \exists \text{ valid execution step } s \xrightarrow{g} s' : s \in [e] \} \subseteq \bigcup\_{(e,g,e') \in \leadsto} \{e' \}$$


$$\forall e \in E, E\_{\text{sub}} \subseteq E: \mathsf{stop}(e, E\_{\text{sub}}) \implies \left[ e \right] \subseteq \bigcup\_{e' \in E\_{\text{sub}}} \left[ e' \right]$$

To run the configured analysis, one executes a meta reachability analysis, the so-called CPA algorithm, configured by the CPA and provides an initial value einit ∈ E which the analysis will start with. For details on the CPA algorithm, we refer the reader to [17].

As part of our ranged analysis, we use the abstract domain and transfer relation of a CPA V for value analysis [9] (also known as constant propagation or explicit analysis). An abstract state v of the value analysis ignores program locations and maps each variable to either a concrete value of its domain or >, which represents any value. The partial order v<sup>V</sup> and the join operator t<sup>V</sup> are defined variable-wise while ensuring that v v<sup>V</sup> v <sup>0</sup> ⇔ ∀v ∈ V : v(v) = v 0 (v) ∨ v 0 (v) = ><sup>7</sup> and (v t<sup>V</sup> v 0 )(v) = v(v) if v(v) = v 0 (v) and otherwise (v t<sup>V</sup> v 0 )(v) = >. The concretization of abstract state v contains

<sup>7</sup> Consequently, ∀v ∈ V : >V(v) = >.

Fig. 2: Composition of three ranged analyses (in orange)

all concrete states that agree on the concrete variable values, i.e., <sup>J</sup>vK<sup>V</sup> := {(`, c) ∈ Loc × C | ∀v ∈ V : v(v) = > ∨ v(v) = c(v)}. If the values for all relevant variables are known, the transfer relation <sup>V</sup> will behave like the program semantics. Otherwise, it may overapproximate the executability of a CFA edge and may assign value > if a concrete value cannot be determined.

To easily build ranged analysis instances for various program analyses, we modularize our ranged analysis into a ranged reduction and a program analysis. Technically, we will compose a ranged analysis from different CPAs using the concept of a composite CPA [17]. We demonstrate the composition for two CPAs. The composition of more than two CPAs works analogously or can be achieved by recursively composing two (composite) CPAs. A composite CPA A<sup>×</sup> = (D×, <sup>×</sup>, merge×,stop×) of CPA A<sup>1</sup> = ((Loc × C,(E1, >1, v1, <sup>t</sup>1), <sup>J</sup>·K1), <sup>1</sup>, merge<sup>1</sup> ,stop<sup>1</sup> ) and CPA <sup>A</sup><sup>2</sup> = ((Loc <sup>×</sup> C,(E2, <sup>&</sup>gt;2, <sup>v</sup>2, <sup>t</sup>2), <sup>J</sup>·K2), <sup>2</sup>, merge<sup>2</sup> ,stop<sup>2</sup> ) considers the product domain D<sup>×</sup> = (Loc×C,(E<sup>1</sup> ×E2,(>1, >2), <sup>v</sup>×, <sup>t</sup>×), <sup>J</sup>·K<sup>×</sup>) that defines the operators elementwise, i.e., (e1, e2) <sup>v</sup><sup>×</sup> (<sup>e</sup> 0 1 , e<sup>0</sup> 2 ) if e<sup>1</sup> v<sup>1</sup> e 0 <sup>1</sup> and e<sup>2</sup> v<sup>2</sup> e 0 2 , (e1, e1) t<sup>×</sup> (e 0 1 , e<sup>0</sup> 2 ) = (e<sup>1</sup> t<sup>1</sup> e 0 1 , e<sup>2</sup> t e 0 2 ), and <sup>J</sup>(e1, e2)<sup>K</sup> <sup>=</sup> <sup>J</sup>e1K<sup>1</sup> <sup>∩</sup> <sup>J</sup>e2K2. The transfer relation may be the product transfer relation or may strengthen the product transfer relation using knowledge about the other abstract successor. In contrast, merge<sup>×</sup> and stop<sup>×</sup> cannot be derived and must always be defined.

### 3 Composition of Ranged Analyses

In this section, we introduce the composition of ranged analyses as a generalization of ranged symbolic execution to arbitrary program analyses. The overall goal is to split the program paths into multiple disjoint ranges each of which is being analysed by a (different) program analysis. Therein, the task of a program analysis is to verify whether a program fulfils a given specification. Specifications are often given in the form of error locations, so that the task is proving the unreachability of error locations. The results for the verification task contain a verdict and potentially an additional witness (a justification or a concrete path violating the specification [14]). The verdict indicates whether the program fulfils the specification (verdict "true"), violates it (verdict "false") or if the analysis did not compute a result (verdict "unknown").

To ensure that an arbitrary program analysis operates on paths within a given range only, we employ ranged analysis. A ranged analysis is realized as

Fig. 3: Application of range reduction on the running example of Fig. 1

a composition of an arbitrary program analysis (a CPA) and a range reduction (also given as a CPA below) ensuring path exploration to stay within the range. Then, a composition of ranged analyses is obtained by (1) splitting the program into ranges, (2) then running several ranged analyses in parallel, and (3) at the end aggregating analysis results (see Fig. 2). Splitting is described in Sec. 4. For aggregation, we simply return the verdict "false" whenever one analysis returns "false", we return "unknown" whenever no analysis returns "false" and one analysis returns "unknown" or aborts, otherwise we return "true". We do not support aggregation of witnesses yet (but this could be realized similar to [70]).

#### 3.1 Ranged Analysis

Next, we define ranged analysis as a CPA composition of the target program analysis and the novel range reduction. The range reduction decides whether a path is included in a range [πτ<sup>1</sup> , πτ<sup>2</sup> ] and limits path exploration to this range. We decompose the range reduction for [πτ<sup>1</sup> , πτ<sup>2</sup> ] into a composition of two specialized ranged reductions <sup>R</sup>[πτ<sup>1</sup> ,π<sup>&</sup>gt; ] and <sup>R</sup>[π<sup>⊥</sup> ,πτ<sup>2</sup> ] , which decide whether a path is in the range [π<sup>τ</sup><sup>1</sup> , π<sup>&</sup>gt; ] and [π<sup>⊥</sup> , π<sup>τ</sup><sup>2</sup> ], respectively. Since [π<sup>τ</sup><sup>1</sup> , π<sup>τ</sup><sup>2</sup> ] = [π<sup>τ</sup><sup>1</sup> , π<sup>&</sup>gt; ] ∩ [π<sup>⊥</sup> , π<sup>τ</sup><sup>2</sup> ] and the composition stops the exploration of a path if one analysis returns ⊥, the composite analysis <sup>R</sup>[πτ<sup>1</sup> ,πτ<sup>2</sup> ] <sup>=</sup> <sup>R</sup>[π<sup>⊥</sup> ,πτ<sup>2</sup> ]×R[πτ<sup>1</sup> ,π<sup>&</sup>gt; ] only explores paths that are included in both ranges (which are exactly the paths in [π<sup>τ</sup><sup>1</sup> , π<sup>τ</sup><sup>2</sup> ]). Figure 3 depicts the application of range reduction to the example from Fig. 1, where the range reduction <sup>R</sup>[π<sup>⊥</sup> ,πτ<sup>2</sup> ] is depicted in Fig. 3a and <sup>R</sup>[πτ<sup>1</sup> ,π> ] in Fig. 3b and the composition of both range reductions in Fig. 3c. Finally, the ranged analysis of any arbitrary program analysis A in a given range [π<sup>τ</sup><sup>1</sup> , π<sup>τ</sup><sup>2</sup> ] can be represented as a composition:

$$\mathbb{R}\_{[\pi\_{\tau\_1}, \pi^\perp]} \times \mathbb{R}\_{[\pi\_\perp, \pi\_{\tau\_2}]} \times \mathbb{A}$$

For <sup>R</sup>[πτ<sup>1</sup> ,πτ<sup>2</sup> ] , we define merge<sup>×</sup> component-wise for the individual merge operators and stop<sup>×</sup> as conjunction of the individual stop operators. As soon as the range reduction decides that a path π is not contained in the range [π<sup>τ</sup><sup>1</sup> , π<sup>τ</sup><sup>2</sup> ] and returns ⊥, the exploration of the path stops for all analyses defined in the composition.

### 3.2 Range Reduction as CPA

Next, we define the range reduction <sup>R</sup>[πτ<sup>1</sup> ,π<sup>&</sup>gt; ] (R[π<sup>⊥</sup> ,πτ<sup>2</sup> ] , respectively) as a CPA, which tracks whether a state is reached via a path in [πτ<sup>1</sup> , π<sup>&</sup>gt; ] ([π<sup>⊥</sup> , πτ<sup>2</sup> ]).

Initialisation. To define the CPAs for <sup>R</sup>[πτ<sup>1</sup> ,π<sup>&</sup>gt; ] and <sup>R</sup>[π<sup>⊥</sup> ,πτ<sup>2</sup> ] , we reuse components of the value analysis V (as described in Sec. 2.3). A value analysis explores at least all feasible paths of a program by tracking the values for program variables. If the program behaviour is fully determined (i.e., all (input) variables are set to constants), then only one feasible, maximal path exists, which is explored by the value analysis. We exploit this behaviour by initializing the analysis based on our test case τ (being a lower or upper bound of a range):

$$e\_{IMI} = \begin{cases} v(x) = \tau(\bar{x}) & \text{if } x \in dom(\tau), x \in \dot{\mathcal{V}}\\ v(x) = \top & \text{otherwise} \end{cases}$$

In this case, all variables which are typically undetermined<sup>8</sup> and dependent on the program input have now a determined value, defined through the test case. As the behaviour of the program under the test case τ is now fully determined, the value analysis only explores a single path π<sup>τ</sup> , which corresponds to the execution trace of the program given the test case. Now, as we are interested in all paths defined in a range and not only a single path, we adapt the value analysis as follows:

Lower Bound CPA. For the CPA range reduction <sup>R</sup>[πτ<sup>1</sup> ,π<sup>&</sup>gt; ] we borrow all components of the value analysis except for the transfer relation <sup>τ</sup><sup>1</sup> . The transfer relation <sup>τ</sup><sup>1</sup> is defined as follows:

$$(v, g, v') \in \leadsto\_{\tau\_1} \text{ iff } \begin{cases} v = \top \land v' = \top, \text{or} \\ v \neq \top \land v' = \top \land B\_P(g) = F \land (v, g, \bot) \in \leadsto\_{\mathsf{V}}, \text{or} \\ v \neq \top \land \left(v' \neq \bot \lor B\_P(g) \neq F\right) \land (v, g, v') \in \leadsto\_{\mathsf{V}} \end{cases}$$

Note that > represents the value analysis state where no information on variables is stored and ⊥ represents an unreachable state in the value analysis, which stops the exploration of the path. Hence, the second case ensures that <sup>R</sup>[πτ<sup>1</sup> ,π<sup>&</sup>gt; ] also visits the false-branch of a condition when the path induced by τ<sup>1</sup> follows the true-branch. Note that in case that <sup>V</sup> computes ⊥ as a successor state for a assumption g with B<sup>P</sup> (g) = T, the exploration of the path is stopped, as π<sup>τ</sup><sup>1</sup> follows the false-branch (contained in the third case).

Upper Bound CPA. For the CPA range reduction <sup>R</sup>[π<sup>⊥</sup> ,πτ<sup>2</sup> ] we again borrow all components of the value analysis except for the transfer relation <sup>τ</sup><sup>2</sup> . The transfer relation <sup>τ</sup><sup>2</sup> is defined as follows:

$$(v, g, v') \in \leadsto\_{\tau\_2} \text{ iff } \begin{cases} v = \top \land v' = \top \\ v \ne \top \land v' = \top \land B\_P(g) = T \land (v, g, \bot) \in \leadsto\_V \\ v \ne \top \land \left(v' \ne \bot \lor B\_P(g) \ne T\right) \land (v, g, v') \in \leadsto\_V \end{cases}$$

The second condition now ensures that <sup>R</sup>[π<sup>⊥</sup> ,πτ<sup>2</sup> ] also visits the true-branch of a condition when π<sup>τ</sup><sup>2</sup> follows the false-branch.

<sup>8</sup> Assuming that randomness is controlled through an input and hence the program is deterministic.

### 3.3 Handling Underspecified Test Cases

So far, we have assumed that test cases are fully specified, i.e., contain values for all input variables, and the behaviour of the program is deterministic such that executing a test case τ follows a single (maximal) execution path π<sup>τ</sup> . However, in practice, we observe that test cases can be underspecified such that a test case τ does not provide concrete values for all input variables. We denote by P<sup>τ</sup> the set of all paths that are then induced by τ . In this case, we define:

and

$$[\pi\_\perp, P\_\tau] = \{\pi \mid \forall \pi' \in P\_\tau : \pi \le \pi'\} = \{\pi \mid \pi \le \min(P\_\tau)\}$$

$$[P\_\tau, \pi^\top] = \{\pi \mid \exists \pi' \in P\_\tau : \pi' \le \pi\} = \{\pi \mid \min(P\_\tau) \le \pi\}$$

Interestingly enough, by defining π<sup>τ</sup> = min(P<sup>τ</sup> ) for an underspecified test case τ we can handle the range as if τ would be fully specified.

### 4 Splitting

A crucial part of the ranged analysis is the generation of ranges, i.e., the splitting of programs into parts that can be analysed in parallel. The splitting has to either compute two paths or two test cases, both defining one range. Ranged symbolic execution [86] employs a random strategy for range generation (together with an online work-stealing concept to balance work among different workers). For the work here, we have also implemented this random strategy, selecting random paths in the execution tree to make up ranges. In addition, we propose a novel strategy based on the number of loop unrollings. Both strategies are designed to work "on-the-fly" meaning that none requires building the full execution tree upfront, they rather only compute the paths or test cases that are used to fix a range. Next, we explain both strategies in more detail, especially how they are used to generate more than two ranges.

Bounding the Number of Loop Unrollings (Lb). Given a loop bound i ∈ N, the splitting computes the left-most path in the program that contains exactly i unrollings of the loop. If the program contains nested loops, each nested loop is unrolled for i times in each iteration of the outer loop. For the computed path, we (1) build its path formula using the strongest post-condition operator [46], (2) query an SMT-solver for satisfiability and (3) in case of an answer SAT, use the evaluation of the input variables in the path formula as one test case. In case that the path formula is unsatisfiable, we iteratively remove the last statement from the path, until a satisfying path formula is found. A test case τ determined in this way defines two ranges, namely [π<sup>⊥</sup> , π<sup>τ</sup> ] and [π<sup>τ</sup> , π<sup>&</sup>gt; ]. In case that the program is loop-free, the generation of a test case fails and we generate a single range [π<sup>⊥</sup> , π<sup>&</sup>gt; ]. In the experiments, we used the loop bounds 3 (called Lb3) and 10 (called Lb10) with two ranges each. To compute more than two ranges, we use intervals of loop bounds.

Generating Ranges Randomly (Rdm). The second splitting strategy selects the desired number of paths randomly. At each assume edge in the program

Fig. 4: Construction of a ranged analysis from an off-the-shelf program analysis

(either a loop head or an if statement), it follows either the true- or the falsebranch with a probability of 50%, until it reaches a node in the CFA without successor. Again, we compute the path formula for that path and build a test case. This purely random approach is called Rdm.

Selecting the true- or the false-branch with the same probability may lead to fairly short paths with few loop iterations. As the execution tree of a program is often not balanced, it rather grows to the left (true-branches). Thus, we used a second strategy based on random walks, which takes the true-branch with a probability of 90%. We call this strategy Rdm9.

### 5 Implementation

To show the advantages of the composition of ranged analyses, especially the possibility of running conceptually different analyses on different ranges of a program, we realized the range reduction from Sec. 3.2 and the ranged analyses in the tool CPAchecker [21]. The realization of the range reduction follows our formalization, i.e., it reuses elements from the value analysis, which are already implemented within CPAchecker.

Due to the composite pattern, we can build a ranged analysis as composition of range reduction and any existing program analysis within CPAchecker with nearly no effort. We can also use other (non CPA-based) off-the-shelf analyses by employing the construction depicted in Fig. 4: Instead of running the analysis in parallel with the range reduction CPA, we can build a sequential composition of the range reduction and the analysis itself. As off-the-shelf tools take programs as inputs, not ranges, we first construct a reduced program,which by construction only contains the paths within the given range. For this, we can use the existing residual program generation within CPAchecker [19].

The composition of ranged analyses from Sec. 3 is realized using the tool CoVeriTeam [20]. CoVeriTeam allows building parallel and sequential compositions using existing program analyses, like the ones of CPAchecker. We use CoVeriTeam for the orchestration of the composition of ranged analyses. The implementation follows the structure depicted in Fig. 2 and also contains the Aggregation component. It is configured with the program analysis A1, · · · , A<sup>n</sup> and a splitting component. For splitting, we realized the splitters Lb3, Lb10, Rdm and Rdm9 in CPAchecker. Each splitter generates test cases in the standardized XML-based TEST-Comp test case format<sup>9</sup> . In case that the splitter fails (e.g. Lb3 cannot compute a test-case, if the program does not contain a loop) our implementation executes the analysis A<sup>1</sup> on the interval [π<sup>⊥</sup> , π<sup>&</sup>gt; ]. For the evaluation, we used combinations of three existing program analyses within the ranged analysis, briefly introduced next.

Symbolic Execution. Symbolic execution [73] analyses program paths based on symbolic inputs. Here, states are pairs of a symbolic store, which describes variable values by formulae on the symbolic inputs, and a path condition, which tracks the executability of the path. Operations update the symbolic store and at branching points the path condition is extended by the symbolic evaluation of the branching condition. Furthermore, the exploration of a path is stopped when it reaches the program end or its path condition becomes unsatisfiable.

Predicate Analysis. We use CPAchecker's standard predicate analysis, which is configured to perform model checking and predicate abstraction with adjustable block encoding [22] such that it abstracts at loop heads only. The required set of predicates is determined by counterexample-guided abstraction refinement [35], lazy refinement [64], and interpolation [63].

Bounded Model Checking. We use iterative bounded model checking (BMC). Each iteration inspects the behaviour of the CFA unrolled up to loop bound k and increases the loop bound in case no property violation was detected. To inspect the behaviour, BMC first encodes the unrolled CFA and the property in a formula using the unified SMT-based approach for software verification [15]. Thereafter, it checks the satisfiability of the formula encoding to detect property violations.

For the evaluation, we build four different basic configurations and employed our different range splitters: Ra-2Se and Ra-3Se which employ two resp. three instances of symbolic execution in parallel, Ra-2bmc employing two instances of BMC and Ra-Se-Pred that uses symbolic execution for the range [π<sup>⊥</sup> , π<sup>τ</sup> ] and predicate analysis on [π<sup>τ</sup> , π<sup>&</sup>gt; ] for some computed test input τ .

# 6 Evaluation

Siddiqui and Khurshid concentrated their evaluation on the issue of scaling, i.e., showing that a certain speed-up can be achieved by ranged execution [86]. More specifically, they showed that ranged symbolic execution can speed-up path exploration when employing ten workers operating on ranges in parallel. In contrast, our interest was not in scaling issues only, but also in the obtained verification results. We in particular wanted to find out whether a ranged analysis can obtain more results for verification tasks than analyses in isolation would achieve within the same resource limitations. Furthermore, our evaluation is

<sup>9</sup> https://gitlab.com/sosy-lab/test-comp/test-format/blob/testcomp22/doc/Format. md

different to [86] in that we limit the available CPU time, meaning that both analyses, the default analysis and the composition of ranged analyses, have the same resources and that we employ different analyses. Finally, we were interested in evaluating our novel splitting strategy, in particular in comparison to the existing random strategy. To this end, we studied the following research questions:


### 6.1 Evaluation Setup

All experiments were run on machines with an Intel Xeon E3-1230 v5 @ 3.40 GHz (8 cores), 33 GB of memory, and Ubuntu 20.04 LTS with Linux kernel 5.4.0. We use BenchExec [23] for the execution of our experiments to increase the reproducibility of the results. In a verification run, a tool-configuration is given a task (a program plus specification) and computes either a proof (if the program fulfils the specification) or raises an alarm (if the specification is violated on the program). We limit each verification run to 15 GB of memory, 4 CPU cores, and 15 min of CPU time, yielding a setup that is comparable to the one used in SV-Comp. The evaluation is conducted on a subset of the SV-Benchmarks used in the SV-Comp and all experiments were conducted once. It contains in total 5 400 C-tasks from all sub-categories of the SV-Comp category reach-safety [36]. The specification for this category, and hence for these tasks, states that all calls to the function reach error are unreachable. Each task contains a ground truth that contains the information, whether the task fulfils the specification (3 194 tasks) or not (2 206 tasks). All data collected is available in our supplementary artefact [60].

### 6.2 RQ 1: Composition of Ranged Analyses for Symbolic Execution

Evaluation Plan. To analyse the performance of symbolic execution in a composition of ranged analyses, we compare the effectiveness (number of tasks solved) and efficiency (time taken to solve a task) for the composition of ranged analyses with two and three ranged analyses each using a symbolic execution with one of the four splitters from Sec. 5 against symbolic execution standalone. For efficiency, we compare the consumed CPU time as well as the (real) time taken overall to solve the task (called wall time). The CPU time is always limited for the full configuration, s.t. an instance combining two ranged analyses in parallel has also only 900 s CPU time available, hence at most 450 s per ranged analysis. To achieve a fair comparison, we also executed symbolic execution in CoVeriTeam, where we build a simple configuration that directly calls CPAchecker running its symbolic execution.


Table 1: Number of correct and incorrect verdicts reported by SymbExec and compositions of ranged analyses with symbolic executions using different splitters

Effectiveness. Table 1 compares the verdicts of symbolic execution (Symb-Exec) and the configurations using a composition of ranged analyses with one range (and thus two analyses in parallel, called Ra-2Se) or with two ranges (and three analyses, called Ra-3Se). The table shows the number of overall correct verdicts reported (divided into the number of correct proofs and correct alarms), the number of correct verdicts additionally reported compared to SymbExec as well as the number of incorrect proofs and alarms reported. First of all, we observe that all configurations using a composition of ranged analyses compute more correct verdicts than SymbExec alone. We see the largest increase for Ra-2Se-Lb3, where 116 tasks are additionally solved. This increase comes nearly exclusively from the fact that Ra-2Se-Lb3 computes more correct alarms. The number of reported proofs does not change significantly, as SymbExec and all configurations of the composition of ranged analyses both have to check the same number of paths in the program leading to a property violation (namely all) for being infeasible. Thus, all need to do "the same amount of work" to compute a proof. As the available CPU time is identical for both, the ranged analyses do not compute additional proofs by sharing work. In contrast, for computing an alarm, finding a single path that violates the specification suffices. Thus, using two symbolic execution analyses in parallel working on different parts of the program increases the chance of finding such a violating path. All configurations employing the composition of ranged analyses compute a few more false alarms. For these tasks, SymbExec runs into a timeout and would also compute a false alarm, if its time limit would be increased.

For configurations using three symbolic executions in parallel, we used three splitters: Ra-3Se-Lb, which uses both loop-bound splitters in parallel, i.e., we have the ranges with less than three loop unrollings, three to ten loop unrollings and more than ten, and Ra-3Se-Rdm resp. Ra-3Se-Rdm9, which both employ the random splitting to generate two ranges. Again all configurations can compute more correct alarms compared to SymbExec, even more than Ra-2Se-Lb3. Again, splitting the state space in even more parts that are analysed in parallel increases the chance to find an alarm.

Fig. 5: Scatter plot comparing SymbExec and Ra-2Se-Lb3

Fig. 6: Median factor of time increase for different configurations of Ra-2Se

Finally, when comparing the effectiveness of the different strategies employed to generate bounds, we observe that splitting the program using our novel component Lb3 is more effective than using a randomly generated bound when using two and three symbolic execution analyses in parallel.

Efficiency. For comparing the efficiency of compositions of ranged analyses, we compare the CPU time and the wall time taken to compute a correct solution by SymbExec and several configurations of ranged analysis. We excluded all tasks where the generation of the ranges fails, as SymbExec and the composition of ranged analyses behave equally in these cases. In general, all configurations consume overall approximately as much CPU time as SymbExec to solve all tasks and are even faster w.r.t. wall time. The scatter-plot in Fig. 5 visualizes the CPU time consumed to compute a result in a log-scale by SymbExec (on the x-axis) and by Ra-2Se-Lb3 (on the y-axis), for tasks solved correctly by both analyses. It indicates that for tasks solved quickly, Ra-2Se-Lb3 requires more time than SymbExec, as the points are most of the time above the diagonal, and that the difference gets smaller the longer the analyses run.

We present a more detailed analysis of the efficiency in Fig. 6a and 6b. Each of the bar-plots represents the median factor of the increase in the run time for tasks that are solved by SymbExec within the time interval that is given on the x-axis. If for example SymbExec solves all tasks in five CPU seconds and Ra-2Se-Lb3 in six CPU seconds, the factor would be 1.2, if SymbExec takes five CPU seconds and Ra-2Se-Lb3 only three, the factor is 0.6. The width of the bars corresponds to the number of tasks within the interval. Figure 6a visualizes the comparison of the CPU time for Ra-2Se-Lb3 and SymbExec. For Ra-2Se-Lb3, the median and average increase is 1.6 for all tasks. Taking a closer look, in the median it takes twice as long to solve tasks which are solved by SymbExec within at most ten CPU seconds. Generating the ranges is done for the vast majority of all tasks within a few seconds. For tasks that can be solved in fewer than ten CPU seconds, the nearly constant factor for generating the ranges that is present in each run of Ra-2Se-Lb3 has a large impact on both


Table 2: Number of correct and incorrect verdicts reported by compositions of bounded model checking (upper half) and combinations of symbolic execution and predicate analysis (lower half) using different splitters

CPU and wall time taken. Most importantly, the impact gets smaller the longer the analyses need to compute the result (the factor is constantly decreasing). For tasks that are solved by SymbExec in more than 50 CPU seconds, Ra-2Se-Lb3 is as fast as SymbExec, for tasks solved in more than 100 CPU seconds it is 20% faster. As stated above, the CPU time consumed to computing a proof is not affected by parallelization. Thus, when only looking at the time taken to compute a proof, Ra-2Se-Lb3 takes as long as SymbExec after 50 CPU seconds. In contrast, Ra-2Se-Lb3 is faster for finding alarms in that interval. A more detailed analysis can be found in the artefact [60].

When comparing the wall time in Fig. 6b, the positive effect of the parallelization employed in all configurations of a composition of ranged analyses gets visible. Ra-2Se-Lb3 is faster than SymbExec, when SymbExec takes more than 20 seconds in real time to solve the task. To emphasize the effect of the parallelization, we used pre-computed ranges for Ra-2Se-Lb3. Now, Ra-2Se-Lb3 takes only the 1.1-fold wall time in the median compared to SymbExec, and is equally fast or faster for all tasks solved in more than ten seconds.

The use of compositions of ranged analysis for symbolic execution increases its effectiveness for finding violations of the specification. Moreover, the real overall time consumed to compute the result is reduced for large or complex tasks due to the parallelization employed. We have hence reproduced the findings from [86] in a different setting.

### 6.3 RQ 2: Composition of Ranged Analyses for Other Analyses

Evaluation Plan. To investigate whether other analysis combinations benefit from a composition of ranged analyses, we evaluated two combinations: The first

(a) For Ra-2bmc-Rdm9 and wall time (b) For Ra-Se-Pred-Lb3 and wall time

Fig. 7: Median factor of time increase for different compositions of ranged analyses

uses two instances of BMC (Ra-2bmc), the second one uses symbolic execution on the interval [π<sup>⊥</sup> , π<sup>τ</sup> ] and predicate analysis on the range [π<sup>τ</sup> , π<sup>&</sup>gt; ] (Ra-Se-Pred). We are again interested in effectiveness and efficiency.

Results for BMC. The upper part of Tab. 2 contains the results for a composition of ranged analyses using two instances of BMC. In contrast to Ra-2Se, Ra-2bmc does not increase the number of overall correct verdicts compared to Bmc. Ra-2bmc-Rdm9 computes 48 correct verdicts that are not computed by Bmc, it also fails to compute the correct verdict in 77 cases solved by Bmc. Both observations can mainly be explained from the fact that one analysis computes a result for a task where the other runs into a timeout. Again, we observe that the composition of ranged analyses computes additional alarms (here 36), as both ranged analyses search in different parts of the program.

When comparing the efficiency, we notice that the CPU time consumed to compute a result for Ra-2bmc-Rdm9 (and all other instances) is higher than for Bmc. In average, the increase is 2.6, the median is 2.5, whereas the median increase for tasks solved in more than 100 CPU seconds by Bmc is 1.1. For wall time, where we depict the increases in Fig. 7a, the median overall increase is 1.9. This high overall increase is caused by the fact that Bmc can solve nearly 65% of all tasks within ten seconds wall time. Thus, the effect of computing the splitting has a big impact on the factor. For more complex or larger instances, where Bmc uses more time, the wall time of Ra-2bmc-Rdm9 is comparable, for instances taking more than 100 seconds, both takes approximately the same time.

Results for Predicate Analysis and Symbolic Execution. Table 2 also contains the results for the compositions of ranged analyses using predicate analysis and symbolic execution in combination. Here, the column "add." contains the tasks that are neither solved by Pred nor SymbExec. Both default analyses used in this setting have different strengths, as Pred solves 1 517 tasks not solved by SymbExec, and SymbExec 649 not solved by Pred. 737 tasks are solved by both analyses.

The most successful configuration of the composition of ranged analyses again uses Lb3 for generating the ranges. In comparison to SymbExec and Pred, Ra-Se-Pred-Lb3 computes 635 more overall correct verdicts than SymbExec, but 233 fewer than Pred. It solves 430 tasks not solved by Pred and 918 tasks not solved by SymbExec. Most important, it can compute 36 correct proofs and alarms that are neither found by Pred nor SymbExec. The effect that tasks can be solved by the composition of ranged analyses that are not solvable by one or both instances lays in the fact that both analyses work only on a part of the program, making the verification problem easier. Unfortunately, the remaining part is sometimes still too complex for the analysis to be verified in the given time limit. Then, Ra-Se-Pred-Lb3 cannot compute a final result.

When evaluating the effectiveness of Ra-Se-Pred-Lb3, we need to compare it to both Pred and SymbExec. Figure 7b compares the median factor of the wall time increase for Pred and SymbExec. For both, we observe that the median increase factor of the wall time is high (2.1 for Pred and 1.6 for Symb-Exec) for tasks that are solved quickly (within ten seconds), but decreases for more complex tasks. For tasks that are solved with a wall time greater 100 s, Ra-Se-Pred-Lb3 takes approx. the same time as Pred, and is 10% faster than SymbExec. Important to note that Fig. 7b does not include the situation that Pred or SymbExec does not compute a solution but Ra-Se-Pred-Lb3 does. For the former questions, these cases happen rarely, for Ra-Se-Pred-Lb3 and SymbExec it occurs for 918 tasks. Ra-Se-Pred-Lb3 needs in median 15 seconds wall time to compute a solution when Pred runs into a timeout and 52 seconds for SymbExec, both would lead to an increase factor smaller than 0.1.

In summary, Bmc can partially benefit from using a composition of ranged analyses, although the effect is not as good as for symbolic execution. The use of predicate analysis and symbolic execution within a composition of ranged analyses increases the performance of the weaker performing analysis SymbExec drastically, but slightly decreases the performance of the better performing predicate analysis. Again, Lb3 is a good choice for splitting.

### 7 Related Work

Numerous approaches combine different verification techniques. Selective combinations [6,40,45,51,72,83,92] consider certain features of a task to choose the best approach for that task. Nesting approaches [3,4,25,26,30,32,49,82,84] use one or more approaches as components in a main approach. Interleaved approaches [1,2,5,10,42,50,55,58,62,68,75,78,90,97] alternate between different approaches that may or may not exchange information. Testification approaches [28,29,39,43,52,74,81] often sequentially combine a verification and a validation approach and prioritize or only report confirmed proofs and alarms. Sequential portfolio approaches [44,61] run distinct, independent analyses in sequence while parallel portfolio approaches [91,12,57,65,66,96] execute various, independent analyses in parallel. Parallel white-box combinations [7,9,37,38,54,56,59,79] run different approaches in parallel, which exchange information for the purpose of collaboration. Next, we discuss cooperation approaches that split the search space as we do.

A common strategy for dividing the search space in sequential or interleaved combinations is to restrict the subsequent verifiers to the yet uncovered search space, e.g., not yet covered test goals [12], open proof obligations [67], or yet unexplored program paths [8,10,19,31,33,41,42,47,53,71]. Some parallel combinations like CoDiDroid [80], distributed assertion checking [93], or the compositional tester sketched in conditional testing [12] decompose the verification statically into separate subtasks. Furthermore, some techniques split the search space to run different instances of the same analysis in parallel on different parts of the program. For example, conditional static analysis [85] characterizes paths based on their executed program branches and uses sets of program branches to describe the split. Concurrent bounded model checking techniques [69,77] split paths based on their thread interleavings. Yan et al. [95] dynamically split the input space if the abstract interpreter returns an inconclusive result and analyses the input partitions separately with the abstract interpreter. To realize parallel test-case generation, Korat [76] considers different input ranges in distinct parallel instances. Parallel symbolic execution approaches [82,86,87,88,89,94] and ranged model checking [48] split execution paths, thereby often partitioning the execution tree. The set of paths are characterized by input constraints [89], path prefixes [87,88], or ranges [82,86,94,48] and are either created statically from an initial shallow symbolic execution [87,88,89] or tests [82,86,94] or dynamically based on the already explored symbolic execution tree [27,34,82,86,98]. While we reuse the idea of splitting the program paths into ranges [82,86,94,48], we generalize the idea of ranged symbolic execution [82,86,94] to arbitrary analyses and in particular allow to combine different analyses. Furthermore, we introduce a new static splitting strategy along loop bounds.

# 8 Conclusion

Ranged symbolic execution scales symbolic execution by having several analysis instances ran on different ranges in parallel. In this paper, we have generalized this idea to arbitrary analyses by introducing and formalizing the notion of a composition of ranged analyses. We have moreover proposed and implemented a novel splitting component based on loop bounds. Our evaluation shows that a composition of ranged analyses can in particular increase the number of solved tasks. It furthermore demonstrates the superiority of the novel splitting strategy. As future work we see the incorporation of information sharing between analysis running in parallel.

Data Availability Statement. All experimental data and our open source implementation are archived and available in our supplementary artefact [60].

### References


Proc. CAV. pp. 504–518. LNCS 4590, Springer (2007). https://doi.org/10.1007/ 978-3-540-73368-3 51


techniques. In: Proc. ICTSS. pp. 54–70. LNCS 10533, Springer (2017), https://doi. org/10.1007/978-3-319-67549-7 4


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Runtime Enforcement Using Knowledge Bases

Eduard Kamburjan1() and Crystal Chang Din<sup>2</sup>

<sup>1</sup> University of Oslo, Oslo, Norway eduard@ifi.uio.no <sup>2</sup> University of Bergen, Bergen, Norway crystal.din@uib.no

Abstract. Knowledge bases have been extensively used to represent and reason about static domain knowledge. In this work, we show how to enforce domain knowledge about dynamic processes to guide executions at runtime. To do so, we map the execution trace to a knowledge base and require that this mapped knowledge base is always consistent with the domain knowledge. This means that we treat the consistency with domain knowledge as an invariant of the execution trace. This way, the domain knowledge guides the execution by determining the next possible steps, i.e., by exploring which steps are possible and rejecting those resulting in an inconsistent knowledge base. Using this invariant directly at runtime can be computationally heavy, as it requires to check the consistency of a large logical theory. Thus, we provide a transformation that generates a system which is able to perform the check only on the past events up to now, by evaluating a smaller formula. This transformation is transparent to domain users, who can interact with the transformed system in terms of the domain knowledge, e.g., to query computation results. Furthermore, we discuss different mapping strategies.

### 1 Introduction

Knowledge bases (KBs) are logic-based representations of both data and domain knowledge, for which there exists a rich toolset to query data and reason about data semantically, i.e., in terms of the domain knowledge. This enables domain users to interact with modern IT systems [39] without being exposed to implementation details, as well as to make their domain knowledge available for software applications. KBs are the foundation of many modern innovation drivers and key technologies: Applications range from Digital Twin engineering [31], over industry standards in robotics [23] to expert systems, e.g., in medicine [38].

The success story of KBs, however, is so far based on the use of domain knowledge about static data. The connection to transition systems and programs beyond Prolog-style logic programming has just begun to be explored. This is mainly triggered by tool support for developing applications that use KBs [7,13,28], in a type-safe way [29,32].

In this work, we investigate how one can use domain knowledge about dynamic processes and formalize knowledge about the order of computations to be performed. More concretely, we describe a runtime enforcement technique to use domain knowledge to guide the selection of rules in a transition system, for example to simulate behavior with respect to domain knowledge, a scenario that we use as a guiding example in this article, or to enforce compliance of business process models with respect to restrictions arising from the domain [41].

Approach. At the core, our approach considers the execution trace of a run, i.e., the sequence of rule applications, as a KB itself. As such, it can be combined with the KB that expresses the domain knowledge of dynamic processes (DKDP). The DKDP expresses knowledge about (partial) executions such that the execution trace must be consistent with it before and after every rule application. For example, in a simulation system for geology, the DKDP may express that a certain rock layer A is above a certain rock layer B and, thus, the event to deposit a layer must occur for B, before it occurs for A. Consistency with the DKDP forms a domain invariant for the trace of a system, i.e., a trace property.

To trigger a transition rule, we use a hypothetical execution step: the execution trace is extended with a potential event and the consistency of the extended trace against the DKDP is checked. However using this consistency invariant directly at run time can be computationally heavy, as it requires to check the consistency of a large logical theory. Thus, we give a transformation that removes the need for a hypothetical execution step and instead results in a transition system that evaluates a transformed condition on (1) the existing trace and (2) the parameters of the potentially extended event. This condition does not require domain-specific reasoning anymore. This transformation removes the need for hypothetical execution steps and DKDP can be used to guide any transition system, including languages based on structural operational semantics. For example, it is then possible to express the invariant checking as a guard for the rule that deposits layers (e.g., only deposit A if layer B has been deposited already).

It is crucial that this system is usable for both the domain user (who possesses the domain knowledge) and the programmer (that has to program the interaction with the domain knowledge), a requirement explicitly stressed by Corea et al. [16] for the use of ontologies in business process models. We, thus, carefully designed our framework to increase its usability: First, the reasoning (in the geology example above, from spatial properties of layers to temporal properties of events) is completely performed in the domain and needs not be handled by the transition system. I.e., the programmer must not perform reasoning over the KB in the program itself. Second, the DKDP is expressed over domain events, as the domain users do not have knowledge about implementation details, such as the state organization. Furthermore, the formalization of the DKDP should not be affected by the underlying implementation details such that the DKDP can be reused. The DKDP can reuse the aforementioned industry standards and established ontologies, as well as modeling languages and techniques from ontology engineering [17], such as OWL [42], which are established for domain modeling and more suitable for this task than correctness-focused temporal logics such as LTL [35]: The domain users must not be an expert in programming or verification to contribute to the system.

The transformation that replaces the need for a hypothetical execution step with a transition system evaluating a transformed condition is also transparent to the domain users. We say a transformed guarded rule is applicable if it would not violate consistency w.r.t. the DKDP. Lastly, we provide the domain users possibilities to query the final result, i.e., the KB of the final execution trace, and to explore possible simulations using the defined DKDP. Note that the mapping from trace to KB must not necessarily be designed manually: various (semi-) automatic mapping design strategies are discussed in the paper.

Contributions and Structure. Our main contributions are (1) a system that enforces domain knowledge to guide a transition system at runtime, and (2) a procedure that transforms such a transition system that uses consistency with domain knowledge as an invariant into a transition system using first-order guards over past events in a transparent way. We give preliminaries in Sec. 2 and present our running example in Sec. 3. We formalize our approach in Sec. 4 and give the transformation in Sec. 5, before we discuss (semi-)automatically generated mappings in Sec. 6. We discuss the mappings in Sec. 7 and related work in Sec. 8. Lastly, Sec. 9 concludes.

# 2 Preliminaries

We give some technical preliminaries for knowledge bases as well as transition systems, as far as they are needed for our runtime enforcement technique.

Definition 1 (Domain Knowledge of Dynamic Processes). Domain knowledge of dynamic processes (DKDP) is the knowledge about events and changes.

Example 1 (DKDP in Geology). DKDP describes knowledge about some temporal properties in a domain. In geology, for example, this may be the knowledge that a deposition of some geological layers in Cretaceous should happen after a deposition in Jurassic, because the Cretaceous is after the Jurassic. This can be deduced from, e.g., fossils found in the layers.

A description logic (DL) is a decidable fragment of first-order logic with suitable expressive power for knowledge representation [3]. We do not commit to any specific DL here, but require that for the chosen DL it is decidable to check consistency of a KB, which we define next. A knowledge base is a collection of DL axioms, over individuals (corresponding to first-order logic constants), concepts, also called classes (corresponding to first-order logic unary predicates) and roles, also called properties (corresponding to first-order logic binary predicates).

Definition 2 (Knowledge Base). A knowledge base (KB) K = (R, T , A) is a triple of three sets of DL axioms, where the ABox A contains assertions over individuals, the TBox T contains axioms over concepts, and the RBox R contains axioms over roles. A KB is consistent if no contradiction follows from it.

KBs can be seen as first-order logic theories, so we refrain from introducing them fully formally and introduce them by examples throughout the article. The Manchester syntax [25] is used for DL formulas in examples to emphasize that they model knowledge, but we treat them as first-order logic formulas otherwise. Example 2. Continuing Exp. 1, the following axiom, expressing that Jurassic is before Cretaceous, is expressed by the following ABox axiom, where Jurassic and Cretaceous are individuals, while before is a role.

### before(Jurassic, Cretaceous)

The following TBox axioms express that every layer with Stegosaurus fossils has been deposited during the Jurassic. The first two axioms define the concepts StegoLayer (the class of things having the value Stegosaurus as their contains role) and JurassicLayer (the class of things having the value Jurassic as their during role). The last axiom says that the class of things having the value Stegosaurus as their contains role is a subclass of JurassicLayer. <sup>3</sup> The bold literals are keywords, the literals StegoLayer, JurassicLayer denote concepts/classes, the literals contains, during denote roles/properties and the literals Stegosaurus, Jurassic denote individuals.

StegoLayer EquivalentTo contains value Stegosaurus JurassicLayer EquivalentTo during value Jurassic StegoLayer SubClassOf JurassicLayer

The following RBox axioms express two constraints: The first line states that both below and before roles are asymmetric. The second line states that if a deposition is from an age before the age of another deposition, then it is below that deposition. Formally, the axiom expresses that the concatenation of the following three roles (a) the during role, (b) the before role, and (c) the inverse of the during role, is the sub-property of the below role. I.e., given an individual a, every individual b reachable from a following the chain during, before and the inverse of during, is also reachable by just below.

### Asy(below) Asy(before) during o before o inverse(during) SubPropertyOf below

Knowledge based guiding can be applied to any transition system to leverage domain knowledge during execution. States are not the focus of our work, and neither is the exact form of the rules that specify the transition between states. For our purposes, it suffices to define states as terms, i.e., finite trees where each node is labeled with a name from a finite set of term symbols, and transition rules as transformations between schematic terms. State guards can be added but are omitted for brevity's sake.

Definition 3 (Terms and Substitutions). Let Σ<sup>T</sup> be a finite set of term labels and Σ<sup>V</sup> a disjoint set of term variables. A term t is a finite tree, where each inner node is a term label and each leaf is either a term label or a term variable. The set of term variables in a term t is denoted Σ(t). We denote the set of all terms with T. A substitution σ is a map from term variables to terms without term variables. The application of a substitution σ to a term t, with the usual semantics, is denoted tσ. In particular, if t contains no term variables, then tσ = t.

<sup>3</sup> The first-order equivalent is ∀x. contains(x, Stegosaurus) → during(x, Jurassic)

Rewrite rules map one term to another by unifying a subterm with the head term. The matched subterm is then rewritten by applying the substitution to the body term. Normally one would have additional conditions on the transition rules, but these are not necessary to present semantical guiding.

Definition 4 (Term Rewriting Systems). A transition rule in the term rewriting system has the form

$$t\_{\text{head}} \xrightarrow{\text{r}} t\_{\text{body}}$$

Where r is the name of the rule, and thead, tbody ∈ T are the head and body terms.

A rule matches on a term t with Σ(t) = ∅, if there is a subterm t<sup>s</sup> of t, such that thead = tsσ, for a suitable substitution σ. A rule produces a term t 0 , by matching on subterm t<sup>s</sup> with substitution σ, and generating t 0 by replacing t<sup>s</sup> in t by tsσ 0 , where σ 0 is equal to σ for all v ∈ Σ(tbody) ∩ Σ(thead) and maps v ∈ Σ(thead) \ Σ(tbody) to fresh term symbols. For production, we write

$$t \xrightarrow{\mathfrak{r}, \sigma'} t'$$

### 3 A Scenario for Knowledge Based Guiding

To illustrate our approach, we continue with geology, namely with a simulator for deposition and erosion of geological layers. Such a simulator is used, e.g., for hydrocarbon exploration [20]. It contains domain knowledge about the type of fossils and the corresponding geological age, and connects spatial information about deposition layers with temporal information about their deposition. We started a formalization of the DKDP in Ex. 2 and expand it below.

The core challenge is that the simulator must make sure that it does not violate domain properties. This means that it cannot deposit a layer containing fossils from the Jurassic after depositing a layer containing fossils from the Cretaceous. This information is given by the domain users as an invariant, i.e., as knowledge that the execution must be consistent with at all times.

Programming with Knowledge Bases. Our model of computation is a set of rewrite rules on some transition structure. The sequence of rule applications, denoted events, forms the trace. DKDP constrains the extension of the trace. This realizes a clear separation of concerns between declarative data modelling and imperative programming with, in our case, transitions.

Example 3. Let us assume 4 rules: a rule deposit that deposits a layer without fossils, a rule depositStego that deposits a layer with Stegosaurus fossils, an analogous rule depositTRex that deposits a layer with Tyrannosaurus fossils, and a rule erode that removes the top layer of the deposition. One example reduction sequence, for some terms t<sup>i</sup> and with substitutions omitted, is as follows:

$$t\_0 \xrightarrow{\mathsf{dapositStaggo}} t\_1 \xrightarrow{\mathsf{atrod}} t\_2 \xrightarrow{\mathsf{dapositTax}} t\_3$$

Fig. 1. Left: KB as generated. Right: Inferred KB to detect inconsistency.

which describes the rule application of depositStego on term t<sup>0</sup> following by the rule application of erode on term t<sup>1</sup> and then depositTRex on term t2.

In the domain KB, we add an axiom expressing that the geological layer containing Stegosaurus fossils is deposited during the Jurassic, and that the geological layer containing Tyrannosaurus fossils is deposited during the Cretaceous.

Consider that rule depositStego may trigger on term t3.

$$\dots \dots t\_2 \xrightarrow{\mathsf{dapositTRax}} t\_3 \xrightarrow[\text{?}]{\mathsf{dapositStego}} \xrightarrow[\text{?}]{} \mathtt{dapositStego}$$

This would violate the domain knowledge, as we can derive a situation, where a layer with Tyrannosaurus fossils is below a layer with Stegosaurus fossils, implying that the Cretaceous is before the Jurassic. This contradiction is captured by the knowledge base in Fig. 1. The domain knowledge DKDP should prevent this rule application at t<sup>3</sup> to happen. To achieve this, i.e., enforce domain knowledge at runtime, we must connect the trace with the KB. Specifically, we represent the trace as a KB itself, i.e., instead of operating on a KB, we record the events and generate a KB from a trace using a mapping.

For example, consider the left KB in Fig. 1. The upper part is (a part of) our DKDP about geological ages, while the lower part is the KB mapped from the trace. Together they form a KB. In the knowledge base of this example, we add one layer that contains Stegosaurus fossils for each depositStego event and analogously for depositTRex events. We also add the below relation between two layers, if their events are ordered. So, if we would execute depositStego after depositTRex, there would be two layers in the KB as shown in Fig. 1, with corresponding fossils, connected using the below relation. On the right, the KB is shown with the additional knowledge following from its axioms. In particular, we can deduce that layer2 must be below layer1 using the axioms from Sec. 2. This, in turn, makes the overall KB inconsistent, as below must be asymmetric.

We stress that consistency of the execution with the DKDP is a trace property, it is reasoning about the events that happen regardless of the current state. In our example, consider the situation, where the next event after t<sup>3</sup> rule erode triggers again, and then we consider rule depositStego. I.e., the following continuation of the trace

$$\dots t\_2 \xrightarrow{\mathsf{doposit}\,\mathsf{T}\mathsf{Rax}} t\_3 \xrightarrow{\mathsf{aroda}} t\_4 \xrightarrow[\text{?}]{\mathsf{dopositStago}} t\_5$$

We still consider the layer with the Tyrannosaurus fossils in our KB, despite its erosion. Firstly, because the layer may potentially have had an effect on the execution before being removed, and, secondly, because its deposition also models implicit information. It expresses the current geological era of the system, which cannot be reverted: at t<sup>3</sup> the system is in the Cretaceous, and while the depositStego models an action in the Jurassic – the trace would not represent a semantically sensible execution if the depositStego rule would be executed.

Fig. 2 illustrates the runtime enforcement of domain knowledge on traces in a more general setting. The execution itself is a reduction sequence over some terms t, where each rule application emits some event ev, e.g., name of the applied rule and matched subterms. A mapping µ is used to generate a KB from the trace. The knowledge base then contains (a) the DKDP, pictured as the shaded box, (b) the mapping of the trace so far, pictured as the unshaded box with solid frame, and (c) the potential next event, pictured as the dashed box. Additionally, new connections may be inferred.

The mapping from a trace to a KB matches the system formalized by the domain knowledge to the system used for programming, it is the interface between domain experts and the programmer. Indeed, the mapping allows the domain users to investigate program executions without being exposed to the implementation details. Given a fixed execution, the mapping can be applied to allow the domain users to query its results (in form of the trace) using domain vocabulary.

From the program's point of view, it defines an invariant over the trace, which must always hold: consistency with domain knowledge. While this saves the domain users from learning about the implementation, it poses two challenges to the programmer: first, the mapping must be developed additionally to the rules, and second, the invariant is not specific to the rules. The extended trace caused by the execution of one single event, must be checked against the full DKDP, which is not specific to any transition event. Instead of this computationally costly operation, we provide an alternative. For example, to ensure consistency when executing the rule depositStego, it suffices to evaluate the following formula on the past trace tr to check that depositTRex has not been executed yet: ∀i ≤ |tr |. tr [i] 6 .<sup>=</sup> ev(depositTRex). The condition of a rule is specific to the corresponding transition action, instead of a general condition on all the rules.

After defining runtime enforcement of domain knowledge formally, we will return to these challenges and (a) discuss different mapping strategies, and especially the (semi-)automatic generation of mappings and (b) give a system that, for a big class of mappings, also derives local conditions.

### 4 Knowledge Guided Transition Systems

We now introduce runtime enforcement using KBs. To this end, we define the mapping of traces to KBs formally and give the transition system that uses this lifting for consistency checking. First, we define the notion of traces.

Fig. 2. Runtime enforcement of knowledge bases on traces.

Definition 5 (Execution Traces). An event ev for a rule r and a substitution σ has the form ev(r, σ), which we write asev(r, v<sup>1</sup> : t1, . . . , v<sup>n</sup> : tn), where v<sup>i</sup> : t<sup>i</sup> are the pairs in σ. To record the sequence of an execution, we use traces. A trace is a finite sequence of events, where each event records the applied rule and the corresponding substitutions, if there are any.

Example 4. The trace of the rule application in Ex. 3 is as follows, for suitable substitutions that all store the deposited or eroded layer in the variable v.

D ev depositStego, v : layer0 , ev erode, v : layer0 , ev depositTRex, v : layer1<sup>E</sup>

To connect executions with knowledge bases, we define mappings that transform traces into knowledge bases, given a fixed vocabulary Σ.

Definition 6 (Mappings). A Σ-mapping µ is a function from traces to knowledge bases over vocabulary Σ.

The mapping is given by the user, who has to respect the signature of the KB formalizing the used domain knowledge. While we are not specific in the structure of the mapping in general, we introduce the notion of a first-order matching mapping, which allow for optimization and automatization.

Definition 7 (First-Order Matching Mapping). A first-order matching mapping µ is defined by a set {ϕ<sup>1</sup> 7→N<sup>1</sup> ax <sup>1</sup>, . . . , ϕ<sup>n</sup> 7→N<sup>n</sup> ax <sup>n</sup>}, where each element has a first-order logic formula ϕ<sup>i</sup> as its guard, a set of individuals N<sup>i</sup> and some set ax <sup>i</sup> of KB axioms as its body. We write ax <sup>i</sup>(N) to emphasize that a set of individuals N occur in ax <sup>i</sup>(N).

The mapping is applied to a trace tr by adding all those bodies whose guard evaluates to true and replacing all members of N in ax<sup>1</sup> by fresh individual names:

$$\mu(tr) = \left(\bigcup\_{tr \mid = \varphi\_i} ax\_i(N)\right) [N \; fresh]$$

Where A[N fresh] substitutes all individuals in N with fresh names in A.

Example 5. Consider the following first-order matching mapping µ, for some role/property P and individuals A, B and C. The function rule(ev) extracts the rule name from the given event ev.

$$\begin{aligned} \{\exists i. \text{ } \mathtt{rule}(tr[i]) \doteq \mathtt{r}\_{1} \mapsto\_{\varnothing} \mathsf{P}(\mathsf{A}, \mathsf{B}), \quad & \exists i. \text{ } \mathtt{rule}(tr[i]) \doteq \mathtt{r}\_{2} \mapsto\_{\varnothing} \mathsf{P}(\mathsf{B}, \mathsf{A}),\\ \exists i. \text{ } \mathtt{rule}(tr[i]) \doteq \mathtt{r}\_{3} \mapsto\_{\varnothing} \mathsf{P}(\mathsf{A}, \mathsf{C}), \quad & \exists i. \text{ } \mathtt{rule}(tr[i]) \doteq \mathtt{r}\_{4} \mapsto\_{\varnothing} \mathsf{P}(\mathsf{C}, \mathsf{A}) \} \end{aligned}$$

Its application to a trace hev(r1), ev(r1), ev(r2)i is the set {P(A, B), P(B, A)}.

First-order matching mapping can also be applied to our running example.

Example 6. We continue with the trace from Ex. 4, extended with another event ev(depositStego, v : layer2). We check whether adding an event to the trace would result in a consistent KB by actually extending the trace for analysis. We call this a hypothetical execution step.

The following mapping, which must be provided by the user adds the spatial information about layers w.r.t. the fossils found within. The first-order logic formula at the guard of the mapping expresses that an event of depositTRex is found before the event of depositStego in the trace. Note that the given set of axioms from the mapping faithfully describes the event structure of the trace, i.e., the mapping could produce axioms which will cause inconsistency w.r.t. the domain knowledge: Together with the DKDP, we can see that the trace is mapped to an inconsistent knowledge base by adding 5 axioms. Note that we do not generate one layer for each deposition event during simulation, but only two specific ones, Layer(l1) and Layer(l2) in this case, for the relevant information. One can extend mapping rules for the different cases (for instance, depositStego before depositTRex, only depositTRex events, etc.), or use a different mapping mechanism, which we discuss further in Sec. 6.

$$\begin{aligned} &\exists l\_1, l\_2.\exists i\_1, i\_2. \\ &tr[i\_1] \doteq \mathsf{ev}\left(\mathsf{depositTRax}, v: l\_1\right) \land tr[i\_2] \doteq \mathsf{ev}\left(\mathsf{depositStego}, v: l\_2\right) \land i\_1 < i\_2, \\ &\leftrightarrow\_{1\_1, 1\_2} \\ &\left\{\mathsf{Layer}(1\_1), \mathsf{contains}(1\_1, \mathsf{Trynanosaurus}), \\ &\qquad\qquad\qquad\mathsf{Layer}(1\_2), \mathsf{contains}(1\_2, \mathsf{Stegosarrus}), \mathsf{below}(1\_1, 1\_2)\right\} \end{aligned}$$

We stress again that we are interested in trace properties, a layer may still have had effects on the state despite being completely removed at one point (by an erode event). Thus, we must consider the deposition event of a layer to check the trace against the domain knowledge.

The guided transition systems extends the mapping of a basic transition system, by additionally ensuring that the trace after executing the rule would be mapped to a consistent knowledge base. This treats the domain knowledge as an invariant that is enforced, i.e., a transition is only allowed if it indeed preserves the invariant.

Definition 8 (Guided Transition System). Given a set of rules R, a mapping µ and a knowledge base K, the guided semantics is defined as a transition system between pairs of terms t and traces tr . For each rule r ∈ R, we have one guided rule (for consistency, cf. Def. 2):

$$\mathbf{(kb)} \xrightarrow{\mathbf{r}, \sigma \xrightarrow{\mathbf{r}, \sigma} \mathbf{t'} \qquad ev = \mathbf{ev(r, \sigma)} \qquad \mu(tr \circ ev) \cup \mathcal{K} \text{ is consistent} \\ \mathbf{(}\mathbf{(}t, tr) \xrightarrow{\mathbf{r}} \mathbf{(}t', tr \circ ev)$$

The set of traces generated by a rewrite system R from a starting term t<sup>0</sup> is denoted H(R, µ, K, t0). Execution always starts with the empty trace.

### 5 Well-Formedness and Optimization

The transition rule in Def. 8 uses the knowledge base directly to check consistency, and while this enables to integrate domain knowledge into the system directly, it also poses challenges from a practical point of view. First, the condition of the rule application is not specific to the change of the trace, and must check the consistency of the whole knowledge base, which can be computationally heavy. Second, the consistency check is performed at every step, for every potential rule application. Third, the trace must be mapped whenever it is extended. Which means the same mapping computation that has been performed in the previous step may be executed all over again.

To overcome these challenges, we provide a system that reduces consistency checking by using well-formedness guards, which only require to evaluate an expression over the trace without accessing the knowledge base. These guards are transparent to the domain users, the system behaves the same as with the consistency checks of the knowledge base. At its core, we use well-formedness predicates, which characterize the relation of domain knowledge and mappings.

Definition 9 (Well-Formedness). A first-order predicate wf of a trace tr is a well-formedness predicate for some mapping µ and some knowledge base K, if the following holds:

$$\forall tr.\ wf(tr) \iff \mu(tr) \cup \mathcal{K}\text{ is consistent}$$

Using this definition we can slightly rewrite the rule of Def. 8: For every starting term t0, the set of generated traces is the same if the rule of Def. 8 is replaced by the following one

$$(\mathsf{wf})\xrightarrow{t\ \mathsf{r},\sigma} t'\qquad ev=\mathsf{ev}(\mathsf{r},\sigma)\qquad wf(tr\circ ev)$$

$$(t,tr)\xrightarrow{\mathsf{r}}(t',tr\circ ev)$$

For first-order matching mappings, we can generate the well-formedness predicate by testing all possible extensions of the knowledge base upfront and defining the guards of those sets that are causing inconsistency as non-well-formed.

Theorem 1. Let µ be a first-order matching mapping for some knowledge base K. Let Ax = {ax <sup>1</sup>, . . . , ax <sup>n</sup>} be the set of all bodies in µ. Let Incons be the set of all subsets of Ax, such that for each A ∈ Incons, S <sup>a</sup>∈<sup>A</sup> a ∪ K is inconsistent. Let guard<sup>A</sup> be the set of guards corresponding to each body in A. The following predicate wf <sup>µ</sup> is a well-formedness predicate for µ and K.

$$wf\_{\mu} = \neg \bigvee\_{A \in \mathtt{Incons}} \bigwedge\_{\varphi \in \mathtt{guard}\_{A}} \varphi$$

Example 7. We continue with Ex. 5. Consider a knowledge base K expressing that role P is asymmetric. The knowledge base becomes inconsistent if the first two or the last two axioms from µ are added to the knowledge base. Thus, the generated well-formedness predicate wf is the following

$$wf\_{\mu}(tr) \equiv \neg \left( \left( (\exists i. \ \mathsf{rule}(tr[i]) \doteq \mathsf{r}\_{1}) \land (\exists i. \ \mathsf{rule}(tr[i]) \doteq \mathsf{r}\_{2}) \right) \lor \right)$$

$$\left( (\exists i. \ \mathsf{rule}(tr[i]) \doteq \mathsf{r}\_{3}) \land (\exists i. \ \mathsf{rule}(tr[i]) \doteq \mathsf{r}\_{4}) \right)$$

The above procedure has exponential complexity in the number of branches of the mapping. But as the superset of an inconsistent set is also inconsistent, it is not necessary to generate all the subsets. I.e., it suffices to consider the following set of minimal inconsistencies instead, which can be computed by testing for inconsistencies based on the sets ordered by ⊂.

min-Incons = {A | A ∈ Incons ∧ ∀A <sup>0</sup> ∈ Incons. A 0 6= A → A 0 6⊂ A}

If well-formedness is defined inductively, then we can give an even more specific transformation. The well-formedness predicate is inductive, if it checks well-formedness for each trace and its last event is equivalent to the evalution of a formula over the trace, which is specific to the event. If this is the case, then each rule, which dictates the event, can have an own, highly specialized well-formedness guard, which further enhances efficiency.

Definition 10 (Inductive Well-Formedness). A well-formedness predicate wf is inductive <sup>4</sup> for some set of rules R if there is a set of predicates wf <sup>r</sup> for all rules r ∈ R, such that wf can be written as an inductive definition:

$$wf(\langle \rangle) \equiv \text{true}$$

$$wf(tr \circ ev) \equiv wf(tr) \land \bigwedge\_{r \in \mathcal{R}} \left( (\text{rule}(ev) \doteq r) \to wf\_r(tr, ev) \right)$$

in which wf <sup>r</sup> (tr, ev) is the local well-formedness predicate specifically for rule r with the condition rule(ev) .<sup>=</sup> <sup>r</sup>. The predicate wf <sup>r</sup> forms the guard for rule r. Every well-formedness predicate is equivalent to an inductive well-formedness predicate by setting wf <sup>r</sup> (tr, ev) = wf (tr ◦ ev), but we aim to give more specific predicates per rule.

Example 8. Finishing our geological system, we can give local well-formedness predicates for all rules. For example, we can define a specific guard for rule depositStego expressing that the deposition of a layer containing Stegosaurus fossil is not allowed if there is already a deposition of a layer containing Tyrannosaurus fossils captured in the trace tr up to now. Compare with the approach that the whole knowledge base needs to be checked, this proposed solution using

<sup>4</sup> Our well-formedness predicates are inspired by the ones used in verification of concurrent systems, where they characterize traces w.r.t. a specific concurrency model [21].

inductive well-formedness simplifies the complexity of analysis significantly. For instance, the rule for deposition does not need to concern with the ordering of the geological age.

$$\begin{split} \boldsymbol{w}f\_{\mathsf{deposit}}(\boldsymbol{tr}, \mathsf{ev}(\mathsf{deposit}, \boldsymbol{v}: \boldsymbol{l})) &\equiv \boldsymbol{w}f\_{\mathsf{erod}}(\boldsymbol{tr}, \mathsf{ev}(\mathsf{erod}, \boldsymbol{v}: \boldsymbol{l})) \equiv \mathtt{true} \\ \boldsymbol{w}f\_{\mathsf{deposit}\mathsf{TRex}}\left(\boldsymbol{tr}, \mathsf{ev}(\mathsf{deposit}\mathsf{TRex}, \boldsymbol{v}: \boldsymbol{l})\right) &\equiv \mathtt{true} \\ \boldsymbol{w}f\_{\mathsf{deposit}\mathsf{Stego}}\left(\boldsymbol{tr}, \mathsf{ev}(\mathsf{deposit}\mathsf{Stego}, \boldsymbol{v}: \boldsymbol{l})\right) &\equiv \forall i \leq |\boldsymbol{tr}| . \ \mathsf{rule}(\boldsymbol{tr}[i]) \neq \mathsf{deposit}\mathsf{TRex} \end{split}$$

Definition 11 (Transition System using Well-Formedness). Let wf be an inductive well-formedness predicate for a set of rules R, some mapping µ, some knowledge base K. We define the transformed guarded transition system with the following rule for each r ∈ R.

$$(\mathsf{wf} \cdot \mathsf{r}) \xrightarrow{\mathsf{r}, \sigma \xrightarrow{\mathsf{r}, \sigma} \mathsf{t}'} \mathsf{e} \mathsf{v} = \mathsf{e} \mathsf{v}(\mathsf{r}, \sigma) \qquad wf\_{\mathsf{r}}(tr, ev)$$
 
$$(t, tr) \xrightarrow{\mathsf{r}} (t', tr \circ ev)$$

The set of traces generated by this transition system from a starting term t<sup>0</sup> is denoted G(R, wf , t0). Execution always starts with the empty trace.

Note that (a) we do use a specific well-formedness predicate per rule, and that (b) we do not extend the trace tr in the premise as the rules in Def. 8 and Def. 9.

Theorem 2. Let wf be an inductive well-formedness predicate for a set of rules R, some mapping µ, some knowledge base K. The guided system of Def. 8 and Def. 11 generate the same traces: ∀t. H R, µ, K, t = G R, wf , t

We can also define determinism as terms of the inductive well-formedness. An inductive well-formedness predicate wf is deterministic, if for each trace tr and event ev, only one possible local well-formedness predicate wf <sup>r</sup> (tr , ev) holds.

Proposition 1 (Deterministic Well-Formedness). An inductive wellformedness predicate wf with local well-formedness predicates {wf <sup>r</sup>}r∈R is deterministic, if

$$\forall tr.\ \forall ev.\ \bigwedge\_{r\in\mathcal{R}}\left(wf\_r(tr,ev)\to\bigwedge\_{\substack{r'\in\mathcal{R}\\r'\neq r}}\neg wf\_{r'}(tr,ev)\right).$$

For deterministic predicates, only one trace is generated: G R, wf , t  = 1.

When the programmer designs the mapping, the focus is on mapping enough information to achieve inconsistency, to ensure that certain transition steps are not performed. If the same mapping is to be used to retrieve results from the computation, e.g., to query over the final trace, this may be insufficient. Next, we discuss mappings that preserve more, or all information from the trace.

# 6 (Semi-)Automatically Generated Mappings

The mappings we discussed so far require to be defined completely by the programmer and are used to extract a certain correct information from a trace, which is sufficient to enforce domain invariants at runtime. In this section, we introduce mappings which can be constructed (semi-)automatically to simplify the usage of domain invariants: The transducing mappings and direct mappings leverage the structure of the trace directly. A transducing mapping is constructed semi-automatically. It applies some manually defined mapping to each event and automatically connects every pair of consecutive events in a trace using the next role in KB. A direct mapping relates each event with its parameters and is constructed fully automatically. Both kinds of mappings are not only easier to use for the programmer, they can also be used by the domain users to access the results of the computation in terms of the domain.

A transducing mapping is semi-automatic in the sense that part of the mapping is pre-defined, and the programmer must only define a part of it, namely the mapping from a single event to a KB.

Formally, a transducing mapping consists of a function ι that generates unique individual names<sup>5</sup> per event and a user-defined function that maps every event to a KB.

Definition 12 (Transducing Mapping). Let ι an injective function from natural numbers to individuals, and be a function from events to KBs. Let next be an asymmetric role. Given a trace tr, a transducing mapping δ next ι, (tr) is defined as follows. For simplicity, we annotate the index i of an event in tr directly.

$$\delta^{\mathtt{next}}\_{\iota,\epsilon}(\langle\rangle) = \emptyset \qquad \delta^{\mathtt{next}}\_{\iota,\epsilon}(\langle\mathtt{ev}\_{i}\rangle) = \epsilon(\mathtt{ev}\_{i})$$

$$\delta^{\mathtt{next}}\_{\iota,\epsilon}(\langle\mathtt{ev}\_{i},\mathtt{ev}\_{j}\rangle \circ tr) = \epsilon(\mathtt{ev}\_{i}) \cup \{\mathtt{next}(\iota(i),\iota(j))\} \cup \delta^{\mathtt{next}}\_{\iota,\epsilon}(\langle\mathtt{ev}\_{j}\rangle \circ tr)$$

in which the ◦ operator concatenates two traces. This approach is less demanding than to design an arbitrary mapping, as the structure of the sequence between each pair of consecutive events is taken care of by the next role and ι is trivial in most cases: one can just generate a fresh node with the number as part of its individual symbol. The programmer only has to provide a function for events.

Example 9. Our geology example can be reformulated with the following userdefined function geo. Let ιgeo map every natural number i to the symbol layer<sup>i</sup> :

$$\begin{aligned} \epsilon\_{geo}(\mathsf{ev}\_i(\mathsf{depositStego}, v:l)) &= \{\mathsf{contains}(\iota\_{geo}(i), \mathsf{Stegosautrus}), \mathsf{Layer}(\iota\_{geo}(i))\} \\ \epsilon\_{geo}(\mathsf{ev}\_i(\mathsf{depositTRx}, v:l)) &= \{\mathsf{contains}(\iota\_{geo}(i), \mathsf{Tyrannosaurus}), \mathsf{Layer}(\iota\_{geo}(i))\} \\ \epsilon\_{geo}(\mathsf{ev}\_i(\mathsf{deposit}, v:l)) &= \{\mathsf{contains}(\iota\_{geo}(i), \mathsf{Nothing}), \mathsf{Layer}(\iota\_{geo}(i))\} \\ \epsilon\_{geo}(\mathsf{ev}\_i(\mathsf{ev}\_i(\mathsf{rebe}))) &= \emptyset \end{aligned}$$

Note that the function ιgeo(i) is used to generate new symbols for each event, which are then declared to be geological layers by the axiom Layer(ιgeo(i)). It

<sup>5</sup> If using the Resource Description Framework (RDF) [43] for the knowledge base, one requires fresh unique resource identifiers (URI).

generalizes the set of fresh names from first-order matching mappings in Def. 7. Based on this function definition, the example in Sec. 3 can be performed using the transducing mapping δ below ιgeo ,geo . The connections between each pair of consecutive events in a trace, i.e., a layer is below another layer, is derived from the axioms in the domain knowledge and is added as additional axioms to the KB.

So far, the mappings of the trace to some information in terms of a specific domain are defined by the programmer. To further enhance the automation of the mapping construction, we give a direct mapping, that captures all information of a trace in a KB. More technically, the direct mapping directly expresses the trace structure using a special vocabulary, which captures domain knowledge about traces themselves and is independent from any application domain. We first define the domain knowledge about trace structure.

Definition 13 (Knowledge Base for Traces). The knowledge base for traces contains the concept Event modeling events, the concept Match modeling one pair of variable and its matching terms, and the concept Term for terms. Furthermore, the functional property appliesRule connects events to rule names (as strings), the property match that connects the individuals for events with the individuals for matches (i.e., an event with the pairs v : t of a variable and the term assigned to this variable), the property var that connects matches and variables (as strings), and term that connects matches and terms.

We remind that KBs only support binary predicates and we cannot avoid formalizing the concept of a match, which connects three parts: event, variable and term. The direct mapping lessens the workload for the programmer further: it requires no additional input and can be done fully automatically. It is a predefined mapping for all programs and is defined by instantiating a transducing mapping using the next role and pre-defined functions direct and ιdirect for and ι. Also, we must generate additional fresh individuals for the matches. The formal definition of the pre-defined functions for the direct mapping is as follows.

Definition 14 (Direct Mapping). The direct mapping is defined as a transducing mapping δ next ιdirect,direct , where the function ιdirect maps every natural number i to an individual ei. The individuals matchi j uniquely identify a match inside a trace for the jth variable of the ith event, and we regard variables as strings containing their names. Function direct is defined as follows:

direct(evi(r, v<sup>1</sup> : t1, . . . , vn, tn)) = {Event(ιdirect(i)), appliesRule(ιdirect(i),r)}∪ [ j≤n {match(ιdirect(i), matchi j), var(matchi j, v<sup>j</sup> ), term(matchi j, η(t<sup>j</sup> )} ∪ δ(t<sup>j</sup> ) 

where δ(t<sup>j</sup> ) deterministically generates the axioms for the tree structure of the term t<sup>j</sup> according to Def. 3 and η(t<sup>j</sup> ) returns the individual of the head of t<sup>j</sup> .

The properties match, var and term connect each event with its parameters. For example, the match v : layer0 of the first event in Ex. 4, generates

```
match(e1, match0 1), var(match0 1, "v"), term(match0 1, layer0)
```
where e1 is the representation of the event and match0 1 is the representation of the match in the KB. The complete direct mapping is given in the following example.

Example 10. The direct mapping of Ex. 4 is as follows. We apply the direct function to all three events, where each event has one parameter.

```
n
 Event(e1), Event(e2), Event(e3), Next(e1, e2), Next(e2, e3), appliesRule(e1, "depositStego"),
 appliesRule(e2, "erode"), appliesRule(e3, "depositTRex"), match(e1, m1), var(m1, "v"),
 term(m1, layer0), match(e2, m2), var(m2, "v"), term(m2, layer0), match(e3, m3),
 var(m3, "v"), term(m3, layer1)
                               o
```
# 7 Discussion

Querying and Stability. The mapping can be used by the domain users to interact with the system. For one, it can be used to retrieve the result of the computation using the vocabulary of a domain. For example, the following SPARQL [44] query retrieves all depositions generated during the Jurassic:

```
SELECT ?l WHERE {?l a Layer. ?l during Jurassic}
```
Indeed, one of the main advantages of knowledge bases is that they enable ontology-based data access [46]: uniform data access in terms of a given domain. Another possibility is to use justifications [5]. Justifications are minimal sets of axioms responsible for entailments over a knowledge base, e.g., to find out why it is inconsistent. They are able to explain, during an interaction, why certain steps are not possible.

The programmers do not need to design a complete knowledge base – for many domains knowledge bases are available, for example in form of industrial standards [26,23]. For more specific knowledge bases, clear design principles based on experiences in ontology engineering are available [17]. Note that these KBs are stable and do rarely change. Our system requires a static domain knowledge, as changes in the DKDP can invalidate traces during execution without executing a rule, which is, thus, not a limitation if one uses stable ontologies.

The direct mapping uses a fixed vocabulary, but one can formulate the connection to the domain knowledge by using additional axioms. In Ex. 10, one can declare every event to be a layer. The axiom for depositStego is as follows.

### appliesRule value "depositStego" SubClassOf contains value Stegosaurus

The exact mapping strategy is application-specific – for example, to remove information erode must be handled through additional axioms as well, for example by adding a special concept RemovedLayer that is defined as all layers that

Fig. 3. Runtime comparison.

are matched on by some erode event. We next discuss some of the considerations when choosing the style of mapping, and the limitations of each.

There are, thus, two styles to connect trace and domain knowledge: One can add axioms connecting the vocabulary of traces with the vocabulary of the DKDP (direct mapping), or one can translate the trace into the vocabulary of the DKDP (first-order matching mapping, transducing mappings).

The two styles require different skills from the programmer to interact with the domain knowledge: The first style requires to express a trace as part of the domain as a set of ABox axioms, while the second one requires to connect general traces to the domain using TBox axioms. Thus, the second style operates on a higher level of abstraction and we conjuncture that such mappings may require more interaction with the domain expert and a deeper knowledge about knowledge graphs. However, the same insights needed to define the TBox axioms, are also needed to define the guards of a first-order matching mapping.

Naming Schemes. The transducing mappings and the first-order matching mapping have different naming schemes. A transducing mapping, and thus, a direct mapping, generate a new name per event, while the first-order matching mapping generates a fixed number of new names per rule: A transducing mapping can extract quite extensive knowledge from a trace, with the direct mapping giving a complete representation of it in a KB. As discussed, this requires the user to define general axioms. A first-order matching mapping must work with less names, and extract less knowledge from a trace. Its design requires to choose the right amount of abstraction to detect inconsistencies.

Evaluation. To evaluate whether the proposed system indeed gives a performance increase, we have implemented the running example<sup>6</sup> as follows: The system generates all traces up to length n, using three different transition systems: (a) The guided system (Def. 8) using the transducing mapping of Ex. 9. For reasoning, we use the Apache Jena framework [2]. (b) The guarded system (Def. 11) that uses a native implementation of the well-formedness predicate, and (c) the guarded system that uses the Z3 SMT solver [18] to check the first-order logic guards. The results are shown in Fig. 3.As we can see, the native implementation of the guarded systems is near instant for n ≤ 7, while the guided

<sup>6</sup> https://github.com/Edkamb/KnowEnforce We slightly modified the example and replaced the asymmetry axioms by an equivalent formalization to fit the example into the fragment supported by the Jena OWL reasoner.

system needs more than 409s for n = 7 and shows the expected blow-up due to the N2ExpTime-completeness of reasoning in the logic underlying OWL [30]. The guarded system based on SMT similarly shows a non-linear behavior, but scales better then the guided system. For the evaluation, we ran each system for every n three times and averaged the numbers, using a Ubuntu 21.04 machine with an i7-8565U CPU and 32GB RAM. As we can see, the guarded system allows for an implementation that does not rely on an external, general-purpose reasoners to evaluate the guards and increases the scalability of the system, while the guided system does not scale even for small system and KBs.

### 8 Related Work

Runtime enforcement is a vast research field, for a recent overview we refer to the work of Falcone and Pinisetty [22], and give the related work for combinations of ontologies/knowledge bases and transitions systems in the following.

Concerning the combination of ontologies/knowledge bases and business process modeling, Corea et al. [16] point out that current approaches lack the foundation to annotate and develop ontologies together with business process rules. Our approach focuses explicitly on automating the mapping, or support developers in its development in a specific context, thus satisfying requirement 1 and 7 in their gap analysis for ontology-based business process modelling. Note that most work in this domain uses ontologies for the process model itself, similar to the ontology we give in Def. 13 and Def. 13 (e.g., Rietzke et al. [36]) or the current state (e.g., Corea and Delfmann [15]), not the trace. We refer to the survey of Corea et al. for a detailed overview.

Compared with existing simulators of hydrocarbon exploration [20,47], which formalized the domain knowledge of geological processes directly in the transition rules, we propose a general framework to formalize the domain knowledge in a knowledge base which is independent from the term rewriting system. This clear separation of concerns makes it easier for domain users to use the knowledge base for simulation without having the ability to program.

Tight interactions between programming languages, or transition systems, beyond logical programming and knowledge bases have recently received increasing research attention. The focus of the work of Leinberger [29,32] is the type safety of loading RDF data from knowledge bases into programming languages. Kamburjan et al. [28] semantically lift states for operations on the KB representation of the state, but are not able to access the trace. In logic programming, a concurrent extension of Golog [33] is extended to verify CTL properties with description logic assertions by Zarrieß and Claßen [48].

Cauli at al. [12] use knowledge bases to reason about the security properties of deployment configuration in the cloud, a high level representation of the overall system. As for traces, Pattipati et al. [34] introduce a debugger for C programs that operates on logs, i.e., special Traces. Their system operates post-execution and cannot guide the system. Al Haider et al. [1] use a similar technique to investigate logged traces of a program.

In runtime verification, knowledge bases has been investigated by Baader and Lippmann [6] in ALC-LTL, which uses the description logic ALC instead of propositional variables inside of LTL. An overview over further temporalizations of description logics can be found in the work of Baader et al. [4]. Runtime enforcement has been using to temporal properties over traces since its beginnings [37], but, as a recent survey by Falcone and Pinisetty [22] points out, mainly for security/safety or usage control of libraries. In contrast, our work requires the enforcement to do any meaningful computation and uses a different way to express constraints than prior work: consistency with knowledge bases.

DatalogMTL extends Datalog with MTL operators [9,45] to enable ontologybased data access about sequences using inference rules. The ontology is expressed in these rules, i.e., it is not declarative but an additional programming layer, which we deem unpractical for domain users from non-computing domains. DatalogMTL has been used for queries [10] but not for runtime enforcement.

Traces have been explored from a logical perspective mainly in the style of CTL<sup>∗</sup> , TLA and similar temporal logics. More recently, interest in more expressive temporal properties over traces of programming languages for verification using more complex approaches has risen and led to symbolic traces [11,19], integration of LTL and dynamic logics for Java-like languages [8] and trace languages based on type systems [27]. These approaches have in common that they aim for more expressive power and are geared towards better usability for programmers and simple verification calculi. They are only used for verification, not at runtime, and do not connect to formalized domain knowledge.

The guided system can be seen as a meta-computation, as put forward by Clavel et al. [14] for rewrite logics, which do not discuss the use of consistency as a meta-computation and instead program such meta computations explicitly.

### 9 Conclusion

We present a framework to use domain knowledge about dynamic processes to guide the execution of generic transition systems through runtime enforcement. We give a transformation to use of rule specific guards instead of using the domain knowledge directly as a consistency invariant over knowledge bases. The transformation is transparent and the domain user can interact with the system without being aware of the transformation or implementation details. To reduce the working load on the programmer, we discuss semi-automatic design of mappings using transducing approaches and a pre-defined direct mapping. We also discuss further alternatives, such as additional axioms on the events, and the use of local well-formedness predicates for certain classes of mappings.

Future Work. We plan to investigate how our system can interact with knowledge base evolution [24], a more declarative approach for changes in knowledge bases, as well as other approaches to modeling sequences in knowledge bases [40].

Acknowledgements This work was supported by University of Bergen and Research Council of Norway via SIRIUS (237898) and PeTWIN (294600).

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Specification and Validation of Normative Rules for Autonomous Agents

Sinem Getir Yaman() , Charlie Burholt, Maddie Jones, Radu Calinescu, and Ana Cavalcanti

Department of Computer Science, University of York, York, UK sinem.getir.yaman@york.ac.uk

Abstract. A growing range of applications use autonomous agents such as AI and robotic systems to perform tasks deemed dangerous, tedious or costly for humans. To truly succeed with these tasks, the autonomous agents must perform them without violating the social, legal, ethical, empathetic, and cultural (SLEEC) norms of their users and operators. We introduce SLEECVAL, a tool for specification and validation of rules that reflect these SLEEC norms. Our tool supports the specification of SLEEC rules in a DSL [1] we co-defined with the help of ethicists, lawyers and stakeholders from health and social care, and uses the CSP refinement checker FDR4 to identify redundant and conflicting rules in a SLEEC specification. We illustrate the use of SLEECVAL for two case studies: an assistive dressing robot, and a firefighting drone.

### 1 Introduction

AI and autonomous robots are being adopted in applications from health and social care, transportation, infrastructure maintenance. In these applications, the autonomous agents are often required to perform normative tasks that raise social, legal, ethical, empathetic, and cultural (SLEEC) concerns [2]. There is widespread agreement that these concerns must be considered throughout the development of the agents [3,4], and numerous guidelines propose high-level principles that reflect them [5,6,7,8]. However, to follow these guidelines, the engineers developing the control software of autonomous agents need methods and tools that support formalisation, validation and verification of SLEEC requirements.

The SLEECVAL tool introduced in our paper addresses this need by enabling the specification and validation of SLEEC rules, i.e., nonfunctional requirements focusing on SLEEC principles. To best of our knowledge, our tool is novel in its support for the formalisation and validation of normative rules for autonomous agents, and represents a key step towards an automated framework for specifying, validating and verifying autonomous agent compliance with such rules.

SLEECVAL is implemented as an Eclipse extension, and supports the definition of SLEEC rules in a domain-specific language (DSL). Given a set of such rules, the tool extracts their semantics in tock-CSP [9]—a discrete-time variant of the CSP process algebra [10], and uses the CSP refinement checker FDR4 [11] to detect conflicting and redundant rules, providing counterexamples when such

Fig. 1: Fragment of the SLEEC specification for an assistive dressing robot.

problems are identified. Our SLEECVAL tool and case studies, together with a description of its DSL syntax (BNF Grammar) and tock-CSP semantics are publicly available on our project webpage [12] and GitHub repository [13].

### 2 SLEECVAL: Notation, Components, and Architecture

SLEEC Rule Specification. As illustrated in Fig. 1, SLEEC DSL provides constructs for organising a SLEEC specification into a definition and a rule block. The definition block includes the declarations of events such as UserFallen, which corresponds to the detection of a user having fallen, and measures such as userDistressed, which becomes true when the user is distressed. Events and measures reflect the capabilities of the agent in perceiving and affecting its environment.

A SLEEC rule has the basic form 'when trigger then response'. The trigger defines an event whose occurrence indicates the need to satisfy the constraints defined in the response. For example, Rule1 applies when the event DressingStarted occurs. In addition, the trigger may include a Boolean expression over measures from the definition block. For instance, Rule3 applies when the event OpenCurtainsRequested occurs and, additionally, the Boolean measure userUndressed is true. The response defines requirements for that need to be satisfied when the triggers hold, and may include deadlines and timeouts.

//Conflicting Rules RuleA when OpenCurtainsRequested then CurtainsOpened within 3 seconds RuleB when OpenCurtainsRequested and userUndressed then not CurtainsOpened //Redundant Rules RuleC when DressingStarted then DressingFinished

RuleD when DressingStarted then DressingFinished within 2 minutes

(a) Example of conflicting and redundant rules written in SLEECVAL.

1 // CONFLICT CHECKING 2 SLEECRuleARuleB = timed priority(intersectionRuleARuleB) 3 assert SLEECRuleARuleB:[deadlock-free] 4 // REDUNDANCY CHECKING 5 SLEECRuleCRuleD = timed priority(intersectionRuleCRuleD) 6 assert not MSN::C3(SLEECRuleCRuleD) [T= MSN::C3(SLEECRuleD)

(b) Conflict and redundancy handling in CSP using FDR4.

Fig. 2: SLEECVAL conflict and redundancy checking.

The within construct specifies a deadline for the occurrence of a response. To accommodate situations where a response may not happen within its required time, the otherwise construct can be used to specify an alternative response. In Rule6, the response requires the occurrence of the event HealthChecked in 30 seconds, but provides an alternative to have SupportCalled if there is a timeout.

Importantly, a rule can be followed by one or more defeaters [14], introduced by the unless construct, and specifying circumstances that preempt the original response and provide an alternative. In Rule8, the first unless preempts the response if userUnderdressed is true, and a second defeater preempts both the response and the first defeater if the value of the measure userDistressed is 'high'.

SLEEC Rule Validation. SLEECVAL supports rule validation via conflict and redundancy checks. To illustrate the process, we consider the conflicting RuleA and RuleB from Fig. 2a, for the dressing robot presented above. Each rule is mapped to a tock-CSP process automatically generated by SLEECVAL. To define the checks, SLEECVAL computes the alphabet of each rule, i.e., the set of events and measures that the rule references, and examines each pair of rules.

For rule pairs with disjoint alphabets, there is no need to check consistency or redundancy. Otherwise (i.e., for rule pairs with overlapping alphabets), refinement assertions are generated as illustrated in Fig. 2b. Line 1 defines a tock-CSP process SLEECRuleARuleB that captures the intersection of the behaviours of the rules (in the example, RuleA and RuleB). The assertion in Line 3 is a deadlock check to reveal conflicts. If the assertion fails, there is a conflict between the two rules, and FDR4 provides a counterexample. For instance, the trace below is a counterexample that illustrates a conflict between RuleA and RuleB.

Fig. 3: SLEECVAL workflow.

#### OpenCurtainsRequested → userUndressed.true → tock → tock → tock

This trace shows a deadlock in a scenario in which OpenCurtainsRequested occurs, and the user is undressed, as indicated by the CSP event userUndressed.true. In these circumstances, RuleA imposes a deadline of 3 s for CurtainsOpened to occur, but RuleB forbids it. With a tock event representing 1 s, after three tock events, no further events can occur: tock cannot occur because the maximum 3 s allowed by RuleA have passed, and CurtainsOpened is disallowed by RuleB.

To illustrate our check of redundancy, we consider RuleC and RuleD in Fig. 2a. Line 5 in Fig. 2b defines the CSP process that captures the conjunction of these rules. Line 6 shows the assertion for checking whether RuleC is redundant under RuleD. It checks whether the behaviours allowed by RuleD are those allowed (according to trace-refinement '[T =') by the conjunction of RuleC and RuleD. If they are, it means that RuleC imposes no extra restrictions, and so is redundant. The assertion states that RuleC is not redundant. FDR4 shows that the assertion fails, as expected, since RuleD is more restrictive in its deadline. No counterexample is provided because the refinement holds.

The complexity of this process of validation is quadratic in the number of rules since the rules are considered pairwise. We refer the reader to [9] for background on refinement checking in tock-CSP using FDR4.

Specification and Validation Workflow. The SLEECVAL workflow relies on the three components shown in Fig. 3. We implemented the parser for the SLEEC DSL in Eclipse Xtext [15] using EBNF. The SLEEC concrete syntax provided by SLEECVAL supports highlighting of the keyword elements, and there is extra support in the form of pop-up warnings and errors. SLEECVAL also enforces a simple style for naming rules, events, and measures. Conflicts are treated as errors whereas redundant rules are indicated as warnings.

The tock-CSP processes that define the semantics of the rules are computed through a visitor pattern applied to each element of the SLEEC grammar's syntax tree, with each SLEEC rule converted to a tock-CSP process. The computation is based on translation rules. Each event and measure is modelled in tock-CSP as a channel, with measure types directly converted into existing CSP datatypes, or introduced as a new scalar datatype in CSP.


Table 1: Summary of evaluation results.

### 3 Evaluation

Case studies. We used SLEECVAL to specify and validate SLEEC rules sets for agents in two case studies presented next and summarised in Table 1.

Case study 1. The autonomous agent from the first case study is an assistive dressing robot from the social care domain [16]. The robot needs to dress a user with physical impairments with a garment by performing an interactive process that involves finding the garment, picking it, and placing it over the user's arms and torso. The SLEEC specification for this agent comprises nine rules, a subset of which is shown in Fig. 1. SLEECVAL identified four pairs of conflicting rules and two pairs of redundant rules in the initial version of this SLEEC specification including the conflicting rules RuleA and RuleB, and the redundant rules RuleC and RuleD from Fig. 2a.

Case study 2. The autonomous agent from the second case study is a firefighter drone whose detailed description is available at [17]. Its model identifies 21 robotic-platform services (i.e., capabilities) corresponding to sensors, actuators, and an embedded software library of the platform. We consider scenarios in which the firefighter drone interacts with several stakeholders: human firefighters, humans affected by a fire, and teleoperators.

In these scenarios, the drone surveys a building where a fire was reported to identify the fire location, and it either tries to extinguish a clearly identified fire using its small on-board water reservoir, or sends footage of the surveyed building to teleoperators. If, however, there are humans in the video stream, there are privacy (ethical and/or legal) concerns. Additionally, the drone sounds an alarm when its battery is running out. There are social requirements about sounding a loud alarm too close to a human. The SLEEC specification for this agent consists of seven rules, within which SLEECVAL identified one conflict (between the rules shown in Fig. 4) and seven redundancies. The conflict is due to the fact that Rule3 requires that the alarm is triggered (event SoundAlarm) when the battery level is critical (signalled by the event BatteryCritical) and either the temperature is great than 35◦C or a person is detected, while the defeater from Rule7 prohibits the triggering of the alarm when a person is detected.

Overheads. The overheads of the SLEECVAL validation depend on the complexity and size of the SLEEC specifications, which preliminary discussions with stakeholders suggested might include between several tens and a few hundred rules. In our evaluation, the checks of the 27 assertions from the assistive robot

Rule3 when BatteryCritical and temperature > 35 or personDetected then SoundAlarm Rule7 when BatteryCritical then SoundAlarm unless personDetected then goGome unless temperature > 35

Fig. 4: Conflicting rules for the firefighter drone case study.

case study and of the 63 assertions from the firefighter drone case study were performed in under 30s and 70s, respectively, on a standard MacBook laptop. As the number of checks is quadratic in the size of the SLEEC rule set, the time required to validate a fully fledged rule set of, say, 100–200 rules should not exceed tens of minutes on a similar machine.

Usability. We have conducted a preliminary study in which we have asked eight tool users (including lawyers, philosophers, computer scientists, roboticists and human factors experts) to assess the SLEECVAL usability and expressiveness, and to provide feedback to us. In this trial, the users were asked to define SLEEC requirements for autonomous agents used in their projects, e.g. autonomous cars and healthcare systems. The feedback received from these users can be summarized as follows: (1) SLEECVAL is easy to use and the language is intuitive; (2) The highlighting of keywords, errors messages and warnings is particularly helpful in supporting the definition of a comprehensive and valid SLEEC specification; (3) Using the FDR4 output (e.g., counterexamples) directly is useful as a preliminary solution, but more meaningful messages are required to make rule conflicts and redundancies easier to comprehend and fix.

# 4 Conclusion

We have introduced SLEECVAL, a tool for definition and validation of normative rules for autonomous agents. SLEECVAL uses a DSL for encoding of timed SLEEC requirements, and provides them with a tock-CSP semantics that is automatically calculated by SLEECVAL, as are checks for conflicts and redundancy between rules. We also presented the results from the SLEECVAL use for an assistive dressing robot and a firefighter drone.

In the future, we will consider uncertainty in the agents and their environments by extending the SLEEC DSL with probability constructs. Additionally, we will develop a mechanism to annotate rules with labels that can be used to provide more insightful feedback to SLEEC experts. Finally, a systematic and comprehensive user study is also planned as future work. Our vision is to automate the whole process in Fig. 3 with a suggestive feedback loop allowing users to address validation issues within their rule sets.

# Acknowledgements

This work was funded by the Assuring Autonomy International Programme, and the UKRI project EP/V026747/1 'Trustworthy Autonomous Systems Node in Resilience'.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Towards Log Slicing

Joshua Heneage Dawes1() , Donghwan Shin1,2() , and Domenico Bianculli1()

<sup>1</sup> University of Luxembourg, Luxembourg, Luxembourg {joshua.dawes,domenico.bianculli}@uni.lu <sup>2</sup> University of Sheffield, Sheffield, UK d.shin@sheffield.ac.uk

Abstract. This short paper takes initial steps towards developing a novel approach, called log slicing, that aims to answer a practical question in the field of log analysis: Can we automatically identify log messages related to a specific message (e.g., an error message)? The basic idea behind log slicing is that we can consider how different log messages are "computationally related" to each other by looking at the corresponding logging statements in the source code. These logging statements are identified by 1) computing a backwards program slice, using as criterion the logging statement that generated a problematic log message; and 2) extending that slice to include relevant logging statements.

The paper presents a problem definition of log slicing, describes an initial approach for log slicing, and discusses a key open issue that can lead towards new research directions.

Keywords: Log · Program Analysis · Static Slicing.

### 1 Introduction

When debugging failures in software systems of various scales, the logs generated by executions of those systems are invaluable [5]. For example, given an error message recorded in a log, an engineer can diagnose the system by reviewing log messages recorded before the error occurred. However, the sheer volume of the logs (e.g., 50 GB/h [9]) makes it infeasible to review all of the log messages. Considering that not all log messages are necessarily related to each other, in this paper we lay the foundations for answering the following question: can we automatically identify log messages related to a specific message (e.g., an error message)?

A similar question for programs is already addressed by program slicing [2,14]. Using this approach, given a program composed of multiple program statements and variables, we can identify a set of program statements (i.e., a program slice) that affect the computation of specific program variables (at specific positions in the source code).

Inspired by program slicing, in this paper we take initial steps towards developing a novel approach, called log slicing. We also highlight a key issue to

J. H. Dawes et al. 250

```
(1) logger . info (" check memory status : %s" % mem . status )
(2) db = DB . init ( mode =" default ")
(3) logger . info ("DB connected with mode : %s" % db . mode )
(4) item = getItem ( db )
(5) logger . info (" current item : %s" % item )
(6) if check ( item ) is " error ":
(7) logger . error (" error in item : %s" % item )
```
Fig. 1. An example program Pex

```
(1) check memory status : okay
(2) DB connected with mode : default
(3) current item : pencil
(4) error in item : pencil
```
Fig. 2. An example execution log Lex of Pex

be addressed by further research. Once this issue has been addressed, we expect log slicing to be able to identify the log messages related to a given problematic log message by using static analysis of the code that generated the log. Further, since we will be using static analysis of source code, we highlight that our approach is likely to be restricted to identifying problems that can be localised at the source code level.

The rest of the paper is structured as follows: Section 2 illustrates a motivating example. Section 3 sketches an initial approach for log slicing, while Section 4 shows its application to the example, and discusses limitations and open issues. Section 5 discusses related work. Section 6 concludes the paper.

### 2 Motivating Example

Let us consider a simplified example program Pex (Figure 1) connecting to a database and getting an item from it. For simplicity, we denote Pex as a sequence of program statements hs1, s2, . . . , s7i where s<sup>k</sup> is the k-th statement. We can see that Pex contains logging statements (i.e., s1, s3, s5, and s7) that will generate log messages when executed<sup>3</sup> . Figure 2 shows a simplified execution log Lex of Pex . Similar to Pex , we denote Lex as a sequence of log messages hm1, m2, m3, m4i where m<sup>k</sup> is the k-th log message. Note that we do not consider additional information that is often found in logs, such as timestamp and log level (e.g., info and debug) 4 , so these are omitted.

<sup>3</sup> If a program statement generates a log message when executed, it is considered a logging statement; otherwise, it is a non-logging statement.

<sup>4</sup> We ignore log levels since the user may choose a log message of any level to start log slicing.

The last log message "error in item: pencil" in Lex indicates an error. Calling this log message merr , let us suppose that a developer is tasked with addressing the error by reviewing the log messages leading up to merr . Though we have only four messages in Lex , it is infeasible in practice to review a huge amount of log messages generated by complex software systems. Furthermore, it is not necessary to review all log messages generated before merr since only a subset of them is related to merr ; for example, if we look at Lex and Pex together, we can see that the first log message "check memory status: okay" does not contain information that is relevant to the error message, merr . In particular, we can see this by realising that the variable mem logged in the first log message does not affect the computation of the variable item logged in the error message.

Ultimately, if we can automatically filter out such unrelated messages, with the goal of providing a log to the developer that only contains useful log messages, then the developer will better investigate and address issues in less time. We thus arrive at the central problem of this short paper: How does one determine which log messages are related to a certain message of interest?

An initial, naive solution would be to use keywords to identify related messages. In our example log Lex , one could use the keyword "pencil" appearing in the error message to identify the messages related to the error, resulting in only the third log message. However, if we look at the source code in Pex , we can notice that the second log message "DB connected with mode: default" could be relevant to the error because this message was constructed using the db variable, which is used to compute the value of variable item. This example highlights that keyword-based search cannot identify all relevant log messages, meaning that a more sophisticated approach to identifying relevant log messages is needed.

### 3 Log Slicing

A key assumption in this work is that it is possible to associate each log message with a unique logging statement in source code. We highlight that, while we do not describe a solution here, this is a reasonable assumption because there is already work on identifying the mapping between logging statements and log messages [4,11]. Therefore, we simply assume that the mapping is known.

Under this assumption, we observe that the relationship among messages in the log can be identified based on the relationship among their corresponding logging statements in the source code. Hence, we consider two distinct layers: the program layer, where program statements and variables exist, and the log layer, where log messages generated by the logging statements of the program exist.

To present our log slicing approach, as done in Section 2, let us denote a program P as a sequence of program statements and a log L as a sequence of log messages. Also, we say a program (slice) P 0 is a subsequence of P, denoted by P <sup>0</sup> @ P, if all statements of P <sup>0</sup> are in P in the same order. Further, we extend containment to sequences and write s ∈ P when, with P = hs1, . . . , sui, there is some k such that s<sup>k</sup> = s. The situation is similar for a log message m contained in a log L, where we write m ∈ L. Now, for a program P = hs1, . . . , sui and its execution log L = hm1, . . . , mvi, let us consider a log message of interest m<sup>j</sup> ∈ L that indicates a problem. An example could be the log message "error in item: pencil" from the example log Lex in Figure 2. Based on the assumption made at the beginning of this section, that we can identify the logging statement s<sup>i</sup> ∈ P (in the program layer) that generated m<sup>j</sup> ∈ L (in the log layer), our log slicing approach is composed of three abstract steps as follows:


Step 3: Remove any log message m ∈ L that was not generated by some s<sup>l</sup> ∈ S<sup>l</sup> .

The result of this procedure would be a log slice that contains log messages that are relevant to m<sup>j</sup> .

We highlight that defining the relation relevance<sup>P</sup> for a program P (intuitively, deciding whether the information written to a log by a logging statement is relevant to the computation being performed by some non-logging statement) is a central problem in this work, and will be discussed in more depth in the next section.

# 4 An Illustration of Log Slicing

We now illustrate the application of our log slicing procedure to our example program and log (Figures 1 and 2). Since, as we highlighted in Section 3, the definition of the relevance<sup>P</sup> relation is a central problem of this work, we will begin by fixing a provisional definition. A demonstration of our log slicing approach being applied using this definition of relevance<sup>P</sup> will then show why this definition is only provisional.

# 4.1 A Provisional Definition of Relevance

Our provisional definition makes use of some attributes of statements that can be computed via simple static analyses. In particular, for a statement s, we denote by vars(s) the set of variables that appear in s (where a variable x appears in a statement s if it is found in the abstract syntax tree of s). If s is a logging

<sup>5</sup> Assuming a logging statement does not call an impure function.

```
(2) db = DB . init ( mode =" default ")
(4) item = getItem ( db )
(6) if check ( item ) is " error ":
(7) logger . error (" error in item : %s" % item )
```
Fig. 3. Program slice S<sup>r</sup> of the program Pex when s<sup>7</sup> and its variable item are used as the slicing criterion

statement that writes a message m to the log, then, assuming that the only way in which a logging statement can use a variable is to add information to the message that it writes to the log, the set vars(s) corresponds to the set of variables used to construct the message m. If s is a non-logging statement, then vars(s) represents the set of variables used by s.

Now, let us consider a logging statement s<sup>l</sup> , that writes a message m<sup>l</sup> to the log, and a non-logging statement sr. We define relevance<sup>P</sup> <sup>6</sup> over the statements in a program P by hs<sup>l</sup> , sri ∈ relevance<sup>P</sup> if and only if vars(sl) ∩ vars(sr) 6= ∅. In other words, a logging statement is relevant to a non-logging statement whenever the two statements share at least one variable.

### 4.2 Applying Log Slicing

Taking the program Pex from Figure 1 and the log Lex from Figure 2, we now apply the steps described in Section 3, while considering the log message m<sup>4</sup> ∈ Lex (i.e., "error in item: pencil") to be the message of interest m<sup>i</sup> .

Step 1. Under our assumption that log messages can be mapped to their generating logging statements, we can immediately map m<sup>4</sup> to s<sup>7</sup> ∈ Pex . Once we have identified the logging statement s<sup>7</sup> that generated m4, we slice Pex backwards, using s<sup>7</sup> and its variable item as the slicing criterion. This would yield the program slice S<sup>r</sup> = hs2, s4, s6, s7i as shown in Figure 3.

Step 2. The program slice S<sup>r</sup> = hs2, s4, s6, s7i yielded by Step 1 contains only non-logging statements (apart from the logging statement s<sup>7</sup> used as the slicing criterion). Hence, we must now determine which logging statements (found in Pex ) write messages that are relevant to the statements in Sr. More formally, we must find a sequence of logging statements S<sup>l</sup> @ Pex such that hs<sup>l</sup> , sri ∈ relevance<sup>P</sup> for any logging statement s<sup>l</sup> ∈ S<sup>l</sup> and a non-logging statement s<sup>r</sup> ∈ S<sup>r</sup> \ {s7}. For this, we use the provisional definition of relevance that we introduced in Section 4.1, that is, we identify the logging statements that share variables with the statements in our program slice Sr. For example, let us consider the non-logging statement s<sup>r</sup> = s<sup>2</sup> ∈ S<sup>r</sup> (i.e., "db = DB.init(mode="default")"). Our definition tells us that the logging statement s<sup>l</sup> = s<sup>3</sup> (i.e., "logger.info("DB

<sup>6</sup> We remark that this simple provisional definition of relevance misses relating statements that only share syntactically different aliased variables

J. H. Dawes et al. 254

(3) logger . info ("DB connected with mode : %s" % db . mode ) (5) logger . info (" current item : %s" % item ) (7) logger . error (" error in item : %s" % item )

Fig. 4. Logging statements S<sup>l</sup> relevant to S<sup>r</sup>

```
(2) DB connected with mode : default
(3) current item : pencil
(4) error in item : pencil
```
Fig. 5. Log slicing result from Lex when m<sup>4</sup> is the message of interest

connected with mode: %s" % db.mode)") should be included in S<sup>l</sup> , since vars(s3)∩vars(s2) = {db}. Similarly, the logging statement s<sup>5</sup> should be included in S<sup>l</sup> since vars(s3) ∩ vars(s2) = {item}, and the logging statement s<sup>7</sup> should be included in S<sup>l</sup> since vars(s7)∩vars(s6) = {item}. Note that the logging statement s<sup>2</sup> (i.e., "logger.info("check memory status: %s" % mem.status)") would be omitted by our definition because no statements in S<sup>r</sup> use the variable mem. As a result, with respect to our definition of relevance, S<sup>l</sup> = hs3, s5, s7i as shown in Figure 4.

Step 3. Using S<sup>l</sup> = hs3, s5, s7i, we now remove log messages from Lex that were generated by logging statements not included in S<sup>l</sup> . The result is the sliced log in Figure 5.

### 4.3 Limitations and Open Issues

We now discuss the limitations of the definition of relevance presented so far, along with a possible alternative approach. We also highlight a key open issue.

Limitations. Using a combination of program slicing and our provisional definition of relevance seems, at least initially, to be an improvement on the keywordbased approach described in Section 2. However, the major limitation of this definition, that looks at program variables shared by logging and non-logging statements, is that a logging statement must use variables in the first place. Hence, this definition can no longer be used if we are dealing with log messages that are statically defined (i.e., do not use variables to construct part of the message written to the log). In this case, we must look to the semantic content of the log messages.

An Alternative. Our initial suggestion in this case is to introduce a heuristic based on the intuition that particular phrases in log messages will often accompany particular computation being performed in program source code. Such a heuristic would operate as follows:


We highlight that this token-based approach is to be used in combination with the backwards program slicing described in Section 3.

Further Limitations. While this heuristic takes a step towards inspecting the semantic content of log messages, rather than relying on shared variables, initial implementation efforts have demonstrated the following limitations:


More Issues. In Section 3, we assumed that the mapping between log messages and the corresponding logging statements that generated the log messages is known. However, determining the log message that a given logging statement might generate can be challenging, especially when the logging statement has a non-trivial structure. For example, while some logging statements might consist of a simple concatenation of a string and a variable value, others might involve nested calls of functions from a logging framework. This calls for more studies on finding the correspondence between logging statements and log messages.

Another key problem is the inconsistency of program slicing tools across programming languages (especially weakly-typed ones such as Python). If the underlying program slicing machinery made too many overapproximations, this would affect the applicability of our proposed approach. Furthermore, the capability of the tools for handling complex cases, such as nested function calls across different components, can hinder the success of log slicing.

# 5 Related Work

Log Analysis. The relationship between log messages has also been studied in various log analysis approaches (e.g., performance monitoring, anomaly detection, and failure diagnosis), especially for building a "reference model" [12] that represents the normal behavior (in terms of logged event flows) of the system under analysis. However, these approaches focus on the problem of identifying whether log messages co-occur (that is, one is always seen in the neighbourhood of the other) without accessing the source code [6,10,13,17,18]. On the other hand, we consider the computational relationship between log messages to filter out the log messages that do not affect the computation of the variable values recorded in a given log message of interest.

Log partitioning. Log partitioning, similarly to log slicing, involves separating a log into multiple parts, based on some criteria. In the context of process mining [1], log partitioning is used to allow parallelisation of model construction. In the context of checking an event log for satisfaction of formal specifications [3], slices of event logs are sent to separate instances of a checking procedure, allowing more efficient checking for whether some event log satisfies a formal specification written in a temporal logic. Hence, again, log partitioning, or slicing, is used to parallelise a task. Finally, we highlight that our log slicing approach could be used to generate multiple log slices to be investigated in parallel by some procedure.

Program Analysis including Logging Statements. Traditionally, program analysis [14,2] ignores logging statements since they usually do not affect the computation of program variables. Nevertheless, program analysis including logging statements has been studied as part of log enhancement to measure which program variables should be added to the existing logging statements [7,15] and where new logging statements should be added [16] to facilitate distinguishing program execution paths. Log slicing differs in that it actively tries to reduce the contents of a log. Finally, Messaoudi et al. [8] have proposed a log-based test case slicing technique, which aims to decompose complex test cases into simpler ones using, in addition to program analysis, data available in logs.

# 6 Conclusion

In this short paper, we have taken the first steps in developing log slicing, an approach to helping software engineers in their log-based debugging activities. Log slicing starts from a log message that has been selected as indicative of a failure, and uses static analysis of source code (whose execution generated the log in question) to throw away log entries that are not relevant to the failure.

In giving an initial definition of the log slicing problem, we highlighted the central problem of this work: defining a good relevance relation. The provisional definition of relevance that we gave in Section 4.1 proved to be limited in that it required logging statements to use variables when constructing their log message. To remedy the situation, we introduced a frequency and proximity-based heuristic in Section 4.3. While this approach could improve on the initial definition of relevance, it possessed various limitations that we summarised.

Ultimately, as part of future work, we intend to investigate better definitions of relevance between logging statements and non-logging statements. If we were to carry on with the same idea for the heuristic (using frequency and proximity), future work would involve 1) finding a suitable way to define tokens; 2) reducing identification of coincidental associations between tokens and variables (i.e., reducing false positives); and 3) attempting to identify associations between tokens and variables with a lower frequency.

Acknowledgments. The research described has been carried out as part of the COSMOS Project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under grant agreement No. 957254.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Vamos: Middleware for Best-Effort Third-Party Monitoring

Marek Chalupa () , Fabian Muehlboeck , Stefanie Muroya Lei , and Thomas A. Henzinger

Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria marek.chalupa@ista.ac.at

Abstract. As the complexity and criticality of software increase every year, so does the importance of run-time monitoring. Third-party monitoring, with limited knowledge of the monitored software, and best-effort monitoring, which keeps pace with the monitored software, are especially valuable, yet underexplored areas of run-time monitoring. Most existing monitoring frameworks do not support their combination because they either require access to the monitored code for instrumentation purposes or the processing of all observed events, or both.

We present a middleware framework, Vamos, for the run-time monitoring of software which is explicitly designed to support third-party and best-effort scenarios. The design goals of Vamos are (i) efficiency (keeping pace at low overhead), (ii) flexibility (the ability to monitor black-box code through a variety of different event channels, and the connectability to monitors written in different specification languages), and (iii) easeof-use. To achieve its goals, Vamos combines aspects of event broker and event recognition systems with aspects of stream processing systems. We implemented a prototype toolchain for Vamos and conducted experiments including a case study of monitoring for data races. The results indicate that Vamos enables writing useful yet efficient monitors, is com-

patible with a variety of event sources and monitor specifications, and simplifies key aspects of setting up a monitoring system from scratch.

# 1 Introduction

Monitoring—the run-time checking of a formal specification—is a lightweight verification technique for deployed software. Writing monitors is especially challenging if it is third-party and real-time. In third-party monitoring, the monitored software and the monitoring software are written independently, in order to increase trust in the monitor. In the extreme case, the monitor has very limited knowledge of and access to the monitored software, as in black-box monitoring. In real-time monitoring, the monitor must not slow down the monitored software while also following its execution close in time. In the extreme case, the monitor may not be able to process all observed events and can check the specification only approximately, as in best-effort monitoring.

We present middleware—called Vamos ("Vigilant Algorithmic Monitoring of Software")—which facilitates the addition of best-effort third-party monitors to deployed software. The primary goals of our middleware are (i) performance (keeping pace at low overhead), (ii) flexibility (compatibility with a wide range of heterogeneous event sources that connect the monitor with the monitored software, and with a wide range of formal specification languages that can be compiled into Vamos), and (iii) ease-of-use (the middleware relieves the designer of the monitor from system and code instrumentation concerns).

All of these goals are fairly standard, but Vamos' particular design tradeoffs center around making it as easy as possible to create a best-effort third-party monitor of actual software without investing much time into low-level details of instrumentation or load management. In practice, instrumentation—enriching the monitored system with code that is gathering observations on whose basis the monitor generates verdicts—is a key part of writing a monitoring system and affects key performance characteristics of the monitoring setup [11]. These considerations become even more important in third-party monitoring, where the limited knowledge of and access to the monitored software may force the monitor to spend more computational effort to re-derive information that it could not observe, or combine it from smaller pieces obtained from more (and different) sources. By contrast, current implementations of monitor specification languages mostly offer either very targeted instrumentation support for particular systems or some general-purpose API to receive events, or both, but little to organize multiple heterogeneous event streams, or to help with the kinds of best-effort performance considerations that we are concerned with. Thus, Vamos fills a gap left open by existing tools.

Our vision for Vamos is that users writing a best-effort third-party monitor start by selecting configurable instrumentation tools from a rich collection. This collection includes tools that periodically query webservices, generate events for relevant system calls, observe the interactions of web servers with clients, and of course standard code instrumentation tools. The configuration effort for each such event source largely consists of specifying patterns to look for and what events to generate for them. Vamos then offers a simple specification language for filtering and altering events coming from the event sources, and simple yet expressive event recognition rules that produce a single, global event stream by combining events from a (possibly dynamically changing) number of event sources. Lastly, monitoring code as it is more generally understood—which could be written directly or generated from existing tools for run-time verification like LTL formulae [47], or stream verification specifications [8] such as TeSSLa [41] processes these events to generate verdicts about the monitored system.

Vamos thus represents middleware between event sources that emit events and higher-level monitoring code, abstracting away many low-level details about the interaction between the two. Users can employ both semi-synchronous and completely asynchronous [11] interactions with any or all event sources. Between these two extremes, to decouple the higher-level monitoring code's performance from the overhead incurred by the instrumentation, while putting a bound on how far the monitoring code can lag behind the monitored system, we provide a simple load-shedding mechanism that we call autodrop buffers, which are buffers that drop events when the monitoring code cannot keep up with the rate of incoming events, while maintaining summarization data about the dropped events. This summarization data can later be used by our event recognition system when it is notified that events were dropped; some standard monitoring specification systems can handle such holes in their event streams automatically [32,42,54]. The rule-based event recognition system allows grouping and ordering buffers dynamically to prioritize or rotate within variable sets of similar event sources, and specifying patterns over multiple events and buffers, to extract and combine the necessary information for a single global event stream.

Data from event sources is transferred to the monitor using efficient lock-free buffers in shared memory inspired by Cache-Friendly Asymmetric Buffers [29]. These buffers can transfer over one million events per second per event source on a standard desktop computer. Together with autodrop buffers, this satisfies our performance goal while keeping the specification effort low. As such, Vamos resembles a single-consumer version of an event broker [18,58,48,55,26,1] specialized to run-time monitoring.

The core features we built Vamos around are not novel on their own, but to the best of our knowledge, their combination and application to simplify best-effort third-party monitoring setups is. Thus, we make the following contributions:


# 2 Architectural Overview

Writing a run-time monitor can be a complex task, but many tools to express logical reasoning over streams of run-time observations [19,34,16,49,24,27,41] exist. However, trying to actually obtain a concrete stream of observations from a real system introduces a very different set of concerns, which in turn have a huge effect on the performance properties of run-time monitoring [11].

The goal of Vamos is to simplify this critical part of setting up a monitoring system, using the model shown in Figure 1. On the left side, we assume an arbitrary number of distinct event sources directly connected to the monitor. This is particularly important in third-party monitoring, as information may need to be collected from multiple different sources instead of just a single program, but can be also useful in other monitoring scenarios, e.g. for multithreaded programs.

Fig. 1. The components of a Vamos setup.

The right-most component is called the monitor, representing the part of the monitoring system that is typically generated by a monitoring specification tool, usually based on a single global event stream. As middleware, Vamos connects the two, providing abstractions for common issues that monitor writers would otherwise have to address with boilerplate, but still complicated code.

Given that there are multiple event sources providing their own event streams, but only one global event stream consumed by the monitor, a key aspect is merging the incoming streams into one, which happens in the arbiter. Third-party monitoring often cannot rely on the source-code-based instrumentation that is otherwise common [21,4,14,16,25]; for example, TeSSLa<sup>1</sup> [41] comes with a basic way of instrumenting C programs by adding annotations into the specification that identify events with function calls or their arguments. Instead, it has to rely on things that can be reliably observed and whose meaning is clear, for example system calls, calls to certain standard library functions, or any other information one can gather from parts of the environment one controls, such as sensors or file system. These do not necessarily correspond in a straightforward way to the events one would like to feed into the higher-level monitor and thus need to be combined or split up in various ways. For example, when a program writes a line to the standard output, the data itself might be split into multiple system calls or just be part of a bigger one that contains multiple lines, and there are also multiple system calls that could be used. Therefore, the arbiter provides a way to specify a rule-based event recognition system to generate higher-level events from combinations of events on the different event sources.

Another common assumption in monitoring systems is some global notion of time that can be used to order events. This is not necessarily true for multiple, heterogeneous event sources, and even just observing the events of a multithreaded program can cause events to arrive in an order that does not represent causality. Vamos arbiter specifications are flexible enough to support many userdefined ways of expressing ways of merging events into a single global stream.

Doing this kind of sorting and merging and then potentially arbitrarily complex other computations in both the arbiter and the monitor may take longer than it takes the monitored system to generate events. Especially in third-party monitoring, a monitor may have to reconstruct information that is technically

<sup>1</sup> We keep referring to TeSSLa in the rest of the paper and also chose to use it in our implementation because it is one of the most easily available existing tools we could find. In general, the state of the field is that, while many papers describing similar tools exist, few are actually available [48].

```
1 stream type Observation { Op(arg : int, ret : int); }
2 event source Program : Observation to autodrop(16,4)
3 arbiter : Observation {
4 on Program: hole(n) | ;
5 on Program: Op(arg, ret) | yield;
6 }
7 monitor(2) { on Op(arg, reg) $$ CheckOp(arg, ret); $$ }
              Listing 1.1. A basic asynchronous best-effort monitor.
```
present in the monitored system but cannot be observed, or, worse, the monitor may have to consider multiple different possibilities if information cannot be reliably recomputed. However, as part of our performance goal, we want the monitor to not lag too far behind the monitored system. Therefore, our design splits the monitoring system into the performance and correctness layers. In between the two, events may be dropped as a simple load-shedding strategy.

The performance layer, on the other hand, sees all events and processes each event stream in parallel. Stream processors enable filtering and altering the events that come in, reducing pressure and computational load on the correctness layer. This reflects that in third-party monitoring, observing coarse-grained event types like system calls may yield many uninteresting events. For example, all calls to read may be instrumented, but only certain arguments make them interesting.

A Simple Example Listing 1.1 shows a full Vamos specification (aside from the definition of custom monitoring code in a C function called CheckOp). Stream types describe the kinds of events and the memory layout of their data that can appear in a particular buffer; in this example, streams of type Observation contain only one possible event named Op with two fields of type **int**. For source buffers—created using event source descriptions as in line 2—these need to be based on the specification of the particular event source. Each event source is associated with a stream processor; if none is given (as in this example), a default one simply forwards all events to the corresponding arbiter buffer, here specified as an autodrop buffer that can hold up to 16 events and when full keeps dropping them until there is again space for at least four new events. Using an autodrop buffer means that in addition to the events of the stream type, the arbiter may see a special hole event notifying it that events were dropped. In this example, the arbiter simply ignores those events and forwards all others to the monitor, which runs in parallel to the arbiter with a blocking event queue of size two, and whose behavior we implemented directly in C code between \$\$ escape characters.

### 3 Efficient Instrumentation

Our goals for the performance of the monitor are to not incur too much overhead on the monitored system, and for the monitor to be reasonably up-to-date in terms of the lag between when an event is generated and when it is processed. The key features Vamos offers to ensure these properties while keeping specifications simple are related to the performance layer, which we discuss here.

### 3.1 Source Buffers and Stream Processors

Even when instrumenting things like system calls, in order to extract information from them in a consistent state, the monitored system will have to be briefly interrupted while the instrumentation copies the relevant data. A common solution is to write this data to a log file that the monitor is incrementally processing. This approach has several downsides. First, in the presence of multiple threads, accesses to a single file require synchronization. Second, the common use of string encodings requires extra serialization and parsing steps. Third, file-based buffers are typically at least very large or unbounded in size, so slower monitors eventually exhaust system resources. Finally, writing to the log uses relatively costly system calls. Instead, Vamos event sources transmit raw binary data via channels implemented as limited-size lock-free ring buffer in shared memory, limiting instrumentation overhead and optimizing throughput [29]. To avoid expensive synchronization of different threads in the instrumented program (or just to logically separate events), Vamos also allows dynamically allocating new event sources, such that each thread can write to its own buffer(s). The total number of event sources may therefore vary across the run of the monitor.

For each event source, Vamos allocates a new thread in the performance layer to process events from this source<sup>2</sup> . In this layer, event processors can filter and alter events before they are forwarded to the correctness layer, all in a highly parallel fashion. A default event processor simply forwards all events. The computations done here should be done at the speed at which events are generated on that particular source, otherwise the source buffer will fill up and eventually force the instrumentation to wait for space in the buffer.

### 3.2 Autodrop Buffers

As we already stated, not all computations of a monitor may be able to keep up with the monitored system. Our design separates these kinds of computations into the correctness layer, which is connected with the performance layer via arbiter buffers. The separation is achieved by using autodrop buffers. These buffers provide the most straightforward form of load management via load shedding [59]: if there is not enough space in the buffer, it gathers summarization information (like the count of events since the buffer became full) and otherwise drops the events forwarded to it. Once free space becomes available in the buffer, it automatically inserts a special hole event containing the summarization information. The summarization ensures that not all information about dropped

<sup>2</sup> When event sources can be dynamically added, the user may specify a limit to how many of them can exist concurrently to avoid accumulating buffers the monitor cannot process fast enough. When that limit is hit, new event sources are rejected and the instrumentation drops events that would be forwarded to them.

events is lost, which can help to reduce the impact of load shedding. At minimum, the existence of the hole event alone makes a difference in monitorability compared to not knowing whether any events have been lost [35], and is used as such in some monitoring systems [32,42,54].

In addition to autodrop buffers, arbiter buffers can also be finite-size buffers that block when space is not available, or ininite-size buffers. The former may slow down the stream processor and ultimately the event source, while the latter may accumulate data and exhaust available resources. For some event sources, this may not be a big risk, and it eliminates the need to deal with hole events.

# 4 Event Recognition, Ordering, and Prioritization

Vamos' arbiter specifications are a flexible, yet simple way to organize the information gathered from a—potentially variable—number of heterogeneous event sources. In this section, we discuss the key relevant parts of such specifications—a more complete specification can be found in the Technical Report [13].

### 4.1 Arbiter Rules

We already saw simple arbiter rules in Listing 1.1, but arbiter rules can be much more complex, specifying arbitrary sequences of events at the front of arbitrarily many buffers, as well as buffer properties such as a minimum number of available events and emptiness. Firing a rule can also be conditioned by an arbitrary boolean expression. For example, one rule in the Bank example we use in our evaluation in Section 6 looks as follows:

```
1 on Out : transfer(t2, src, tgt) transferSuccess(t4) |,
2 In : numIn(t0, act) numIn(t1, acc) numIn(t3, amnt) |
3 where $$ t2 == t0 + 4 $$
4 $$ $yield SawTransfer(src, tgt, amnt); ... $$
```
This rule matches multiple events on two different buffers (In and Out), describing a series of user input and program output events that together form a single higher-level event SawTransfer, which is forwarded to the monitor component of the correctness layer. Rules do not necessarily consume the events they have looked at; some events may also just serve as a kind of lookahead. The "|" character in the events sequence pattern separates the consumed events (left) from the lookahead (right). Code between \$\$ symbols can be arbitrary C code with some special constructs, such as the **\$yield** statement (to forward events) above.

The rule above demonstrates the basic event-recognition capabilities of arbiters. By ordering the rules in a certain way, we can also prioritize processing events from some buffers over others. Rules can also be grouped into rule sets that a monitor can explicitly switch between in the style of an automaton.

#### 4.2 Buffer Groups

The rules shown so far only refer to arbiter buffers associated with specific, named event sources. As we mentioned before, Vamos also supports creating event sources dynamically during the run of the monitoring system. To be able to refer to these in arbiter rules, we use an abstraction we call buffer groups.

As the name suggests, buffer groups are collections of arbiter buffers whose membership can change at run time. They are the only way in which the arbiter can access dynamically created event sources, so to allow a user to distinguish between them and manage associated data, we extend stream types with stream fields that can be read and updated by arbiter code. Buffer groups are declared for a specific stream type, and their members have to have that stream type<sup>3</sup> . Therefore, each member offers the same stream fields, which we can use to compare buffers and order them for the purposes of iterating through the buffer group. Now the arbiter rules can also be choice blocks with more rules nested within them, as follows (Both is a buffer group and pos is a stream field):

```
1 choose F,S from Both {
2 on F : Prime(n,p) | where $$ $F.pos < $S.pos $$
3 $$ ... $$
4 on F : hole(n) |
5 $$ $F.pos = $F.pos + n; $$
6 }
```
This rule is a slightly simplified version of one in the Primes example in Section 6. This example does not use dynamically created buffers, but only has two event sources, and uses the ordering capabilities of buffer groups to prioritize between the buffers based on which one is currently "behind" (expressed in the stream field pos, which the buffer group Both is ordered by). The choose rule tries to instantiate its variables with distinct members from the buffer group, trying out permutations in the lexicographic extension of the order specified for the buffer group. If no nested rule matches for a particular instantiation, the next one in order is tried, and the choose rule itself fails if no instantiation finds a match.

To handle dynamically created event sources, corresponding stream processor rules specify a buffer group to which to add new event sources, upon which the arbiter can access them through choose rules. In most cases, we expect that choose blocks are used to instantiate a single buffer, in which case we only need to scan the buffer group in its specified order. Here, a round-robin order allows for fairness, while field-based orderings allow more detailed control over buffers prioritization, as it may be useful to focus on a few buffers at the expense of others, as in our above example.

Another potential option for ordering schemes for buffer groups could be based on events waiting in them, or even the values of those events' associated data. Vamos currently does not support this because it makes sorting much more

<sup>3</sup> Note that stream processors may change the stream type between the source buffer and arbiter buffer, so event sources may use different types, but their arbiter buffers may be grouped together if processed accordingly.

expensive—essentially, all buffers may have to be checked in order to determine the order in which to try matching them against further rules. Some of our experiments could have made use of such a feature, but in different ways—future work may add mechanisms that capture some of these ways.

# 5 Implementation

In this section, we briefly review the key components of our implementation.

### 5.1 Source Buffers and Event Sources

The source buffer library allows low-overhead interprocess communication between a monitored system and the monitor. It implements lock-free asynchronous ring buffers in shared memory, inspired by Cache-Friendly Asymmetric Buffering [29], but extended to handle entries larger than 64 bits<sup>4</sup> . The library allows setting up an arbitrary number of source buffers with a unique name, which a monitor can connect to explicitly, and informing such connected monitors about dynamically created buffers. A user can also provide stream type information so connecting monitors can check for binary compatibility.

We have used the above library to implement an initial library of event sources: one that is used for detecting data races, and several which use either DynamoRIO [9] (a dynamic instrumentation framework) or the eBPF subsystem of the Linux Kernel [10,28,50] to intercept the read and write (or any other) system calls of an arbitrary program, or to read and parse data from file descriptors. The read/write related tools allow specifying an arbitrary number of regular expressions that are matched against the traced data, and associated event constructors that refer to parts of the regular expressions from which to extract the relevant data. Example uses of these tools are included in our artifact [12].

### 5.2 The Vamos Compiler and the TeSSLa Connector

The compiler takes a Vamos specification described in the previous sections and turns it into a C program. It does some minimal checking, for example whether events used in parts of the program correspond to the expected stream types, but otherwise defers type-checking to the C compiler. The generated program expects a command-line argument for each specified event source, providing the name of the source buffer created by whatever actual event source is used. Event sources signal when they are finished, and the monitor stops once all event sources are finished and all events have been processed.

The default way of using TeSSLa for online monitoring is to run an offline monitor incrementally on a log file of serialized event data from a single global

<sup>4</sup> Entries have the size of the largest event consisting of its fixed-size fields and identifiers for variable-sized data (strings) transported in separately managed memory.

event source. A recent version of TeSSLa [33] allows generating Rust code for the stream processing system with an interface to provide events and drive the stream processing directly. Our compiler can generate the necessary bridging code and replace the monitor component in Vamos with a TeSSLa Rust monitor. We used TeSSLa as a representative of higher-level monitoring specification tools; in principle, one could similarly use other standard monitor specification languages, thus making it easier to connect them to arbitrary event sources.

### 6 Evaluation

Our stated design goals for Vamos were (i) performance, (ii) flexibility, and (iii) ease-of-use. Of these, only the first is truly quantitative, and the majority of this section is devoted to various aspects of it. We present a number of benchmark programs, each of which used Vamos to retrieve events from different event sources and organize them for a higher-level monitor in a different way, which provides some qualitative evidence for its flexibility. Finally, we present a case study to build a best-effort data-race monitor (Section 6.4), whose relative simplicity provides qualitative evidence for Vamos' ease of use.

In evaluating performance, we focus on two critical metrics:


Our core claim is that Vamos allows building useful best-effort third-party monitors for programs that generate hundreds of thousands of events per second without a significant slow down of the programs beyond the unavoidable cost of generating events themselves. We provide evidence that corroborates this claim based on three artificial benchmarks that vary various parameters and one case study implementation of a data race monitor that we test on 391 benchmarks taken from SV-COMP 2022 [7].

Experimental setup All experiments were run on a common personal computer with 16 GB of RAM and an Intel(R) Core(TM) i7-8700 CPU with 6 physical cores running on 3.20 GHz frequency. Hyper-Threading was enabled and dynamic frequency scaling disabled. The operating system was Ubuntu 20.04. All provided numbers are based on at least 10 runs of the relevant experiments.

#### 6.1 Scalability Tests

Our first experiment is meant to establish the basic capabilities of our arbiter implementation. An event source sends 10 million events carrying a single 64-bit number (plus 128 bits of metadata), waiting for some number of cycles between

Fig. 2. The percentage of events that reached the final stage of the monitor in a stress test where the source sends events rapidly. Parameters are different arbiter buffer sizes (x-axis) and the delay (Waiting) of how many empty cycles the source waits between sending individual events. The shading around lines shows the 95 % confidence interval around the mean of the measured value. The source buffer was 8 pages large, which corresponds to a bit over 1 300 events.

each event. The performance layer simply forwards the events to autodrop buffers of a certain size, the arbiter retrieves the events, including holes, and forwards them to the monitor, which keeps track of how many events it saw and how many were dropped. We varied the number of cycles and the arbiter buffer sizes to see how many events get dropped because the arbiter cannot process them fast enough—the results can be seen in Figure 2.

At about 70 cycles of waiting time, almost all events could be processed even with very small arbiter buffer sizes (4 and up). In our test environment, this corresponds to a delay of roughly 700 ns between events, which means that Vamos is able to transmit approximately 1.4 million of events per second.

### 6.2 Primes

As a stress-test where the monitor actually has some work to do, this benchmark compares two parallel runs of a program that generates streams of primes and prints them to the standard output, simulating a form of differential monitoring [45]. The task of the monitor is to compare their output and alert the user whenever the two programs generate different data. Each output line is of the form #n : p, indicating that p is the nth prime. This is easy to parse using regular expressions, and our DynamoRIO-based instrumentation tool simply yields events with two 32-bit integer data fields (n and p).

While being started at roughly the same time, the programs as event sources run independently of each other, and scheduling differences can cause them to run out of sync quickly. To account for this, a Vamos specification needs to allocate large enough buffers to either keep enough events to make up for possible scheduling differences, or at least enough events to make it likely that there is

Fig. 3. Overheads (left) and percentage of found errors (right) in the primes benchmark for various numbers of primes and arbiter buffer sizes relative to DynamoRIO-optimized but not instrumented runs. DynamoRIO was able to optimize the program so much that the native binary runs slower than the instrumented one.

some overlap between the parts of the two event streams that are not automatically dropped. The arbiter uses the event field for the index variable n to line up events from both streams, exploiting the buffer group ordering functionality described in Section 4.2 to preferentially look at the buffer that is "behind", but also allowing the faster buffer to cache a limited number of events while waiting for events to show up on the other one. Once it has both results for the same index, the arbiter forwards a single pair event to the monitor to compare them.

Figure 3 shows results of running this benchmark in 16 versions, generating between 10 000 and 40 000 primes with arbiter buffer sizes ranging between 128 and 2024 events. The overheads of running the monitor are small, do not differ between different arbiter buffer sizes, and longer runs amortize the initial cost of dynamic instrumentation. We created a setting where one of the programs generates a faulty prime about once every 10 events and measured how many of these discrepancies the monitor can find (which depends on how many events are dropped). Unsurprisingly, larger buffer sizes are better at balancing out the scheduling differences that let the programs get out of sync. As long as the programs run at the same speed, there should be a finite arbiter buffer size that counters the desynchronization. In these experiments, this size is 512 elements.

Primes with TeSSLa We experimented with a variation of the benchmark that uses a very simple TeSSLa [17,41] specification receiving two streams for each prime generator (i.e., four streams in total): one stream of indexed primes as in the original experiment, and the other with hole events. The specification expects the streams to be perfectly lined up and checks that, whenever the lastseen pairs on both streams have the same index, they also contain the same prime (and ignores non-aligned pairs of primes). We wrote three variants of an arbiter to go in front of that TeSSLa monitor:

Fig. 4. Percentage of primes checked and errors found (of 40 000 events in total) by the TeSSLa monitor for different arbiter specifications and arbiter buffer sizes.


Figure 4 shows the impact of these different arbiter designs on how well the monitor is able to do its task, and that indeed more active arbiters yield better results—without them, the streams are perfectly aligned less than 1% of the time. While one could write similar functionality to align different, unsynchronized streams in TeSSLa directly, the language does not easily support this. As such, a combination of TeSSLa and Vamos allows simpler specifications in a higherlevel monitoring language, dealing with the correct ordering and preprocessing of events on the middleware level.

### 6.3 Bank

In this classic verification scenario, we wrote an interactive console application simulating a banking interface. Users can check bank account balances, and deposit, withdraw, or transfer money to and from various accounts. The condition we want to check is that no operations should be permitted that would allow an account balance to end up below 0.

We use an event source that employs DynamoRIO [9] to dynamically instrument the program to capture its inputs and outputs, which it parses to forward the relevant information to the monitor. The monitor in turn starts out with no knowledge about any of the account balances (and resets any gathered knowledge when hole events indicate that some information was lost), but discovers them through some of the observations it makes: the result of a check balance operation gives precise knowledge about an account's balance, while the success or failure of the deposit/withdraw/transfer operations provides lower and upper bounds on the potential balances. For example, if a withdrawal of some amount

Fig. 5. Results of monitoring a simple banking simulator with Vamos monitor (left) and TeSSLa monitor (right). Boxplots show the difference in the number of reported errors versus the number of errors the application made, in percent.

fails, this amount provides an upper bound on an account's balance, and any higher successive withdrawal attempt must surely fail too.

In the spirit of third-party monitoring, however, the stateful interface does not necessarily make it easy to derive these higher level events. For example, there is no individual confirmation that says that the withdrawal of some amount from some account was successful or not. Instead, the user selects an account, then the withdraw action, is then prompted which amount they would like to withdraw from said account, and after entering said amount, the system only displays a message that the withdrawal failed or was successful. The event source parses each individual step and provides them on two separate streams, one for the inputs and one for the outputs. This is where Vamos' higher-level event recognition capabilities (see also the example in Section 4.1) allow the arbiter to recognize the higher-level events to forward to the monitor, which itself is therefore again much easier to specify.

To conduct measurements, we randomly generated 10 000 (well-formed) inputs and fed them to the banking application as fast as possible. We also let the application generate erroneous outputs (wrong balances, swapping success and failure messages) at random and measured how many those our best-effort third-party monitor was able to detect. The size of the source buffer was one page (128 events) and we varied the size of arbiter buffers from 4 to 2048.

The heavyweight instrumentation we used in this scenario caused the banking application to run through its script about 40% slower than without instrumentation for all sizes of the arbiter buffer, which is more than in our other benchmarks, but seems still plausible for interactive programs, and could be much more optimized. Our second metric is how many errors the monitor actually detects. Figure 5 shows this for both the monitor we described above and a TeSSLa variant that only considers exact knowledge about account balances (no upper or lower bounds) and thus finds fewer errors, demonstrating both an alternate monitor design and the use of our TeSSLa connector. The results vary quite a bit with arbiter buffer sizes and between runs, and the monitor may report more errors than were inserted into the run. This is because, first, especially with smaller buffer sizes, the autodrop buffers may drop a significant portion (up to 60% at arbiter buffer size 4, 5% at size 256) of the events, but the monitor needs to see a contiguous chunk of inputs and outputs to be able to gather enough information to find inconsistencies. Second, some errors cause multiple inconsistencies: when a transfer between accounts is misreported as successful or failed when the opposite is true, the balances (or bounds) of two accounts are wrong. Overall, both versions of the monitor were able to find errors with even smaller sizes of arbiter buffers, and increasing numbers improved the results steadly, matching the expected properties of a best-effort third-party monitor.

### 6.4 Case Study: Data Race Detection

While our other benchmarks were written artificially, we also used Vamos to develop a best-effort data race monitor. Most tools for dynamic data race detection use some variation of the Eraser algorithm [51]: obtain a single global sequence of synchronization operations and memory accesses, and use the former to establish happens-before relationships whenever two threads access the same memory location in a potentially conflicting way. This entails keeping track of the last accessing threads for each location, as well as of the ways in which any two threads might have synchronized since those last accesses. Implemented naïvely, every memory access causes the monitor to pause the thread and atomically update the global synchronization state. Over a decade of engineering efforts directed at tools like ThreadSanitizer [52] and Helgrind [57] have reduced the resulting overhead, but it can still be substantial.

Vamos enabled us to develop a similar monitor at significantly reduced engineering effort in a key area: efficiently communicating events to a monitor running in parallel in its own process, and building the global sequence of events. To build our monitor, we used ThreadSanitizer's source-code-based approach<sup>5</sup> to instrument relevant code locations, and for each such location, we reduce the need for global synchronization to fetching a timestamp from an atomically increased counter. Based on our facilities for dynamically creating event sources, each thread forms its own event source to which it sends events. In the correctness layer, the arbiter builds the single global stream of events used by our implementation of a version of the Goldilocks [22] algorithm (a variant of Eraser [51]), using the timestamps to make sure events are processed in the right order. Autodrop buffers may drop some events to avoid overloading the monitor; when this happens to a thread, we only report data races that the algorithm finds if all involved events were generated after the last time that events were dropped. This means that our tool may not find some races, often those that can only be detected looking at longer traces. However, it still found many races in our experiments, and other approaches to detecting data races in best-effort ways have similar restrictions [56].

Our implementation (contained in our artifact [12]) consists of:

<sup>5</sup> This decision was entirely to reduce our development effort; a dynamic instrumentation source could be swapped in without other changes.

Fig. 6. Comparing running times of the three tools on all 391 benchmarks (left) and the correctness of their verdicts on the subset of 118 benchmarks for which it was possible to determine the ground truth (right). Race vs. no race indicates whether the tool found at least one data race, correct vs. wrong indicates whether that verdict matches the ground truth. For benchmarks with unknown ground truth, the three tools agreed on the existence of data races more than 99% of the time.


As such, we were able to use Vamos to build a reasonable best-effort datarace monitor with relatively little effort, providing evidence that our ease-of-use design goal was achieved. To evaluate its performance, we tested it on 391 SV-COMP [7] concurrency test cases supported by our implementation, and compared it to two state-of-the-art dynamic data race detection tools, ThreadSanitizer [52] and Helgrind [57]. Figure 6 shows that the resulting monitor in most cases caused less overhead than both ThreadSanitizer and Helgrind in terms of time while producing largely the same (correct) verdicts.

### 7 Related Work

As mentioned before, Vamos' design features a combination of ideas from works in run-time monitoring and related fields, which we review in this section.

Event Brokers/Event Recognition A large number of event broker systems with facilities for event recognition [18,58,55,26,1] already exist. These typically allow arbitrary event sources to connect and submit events, and arbitrarily many observers to subscribe to various event feeds. Mansouri-Samani and Sloman [44] outlined the features of such systems, including filtering and combining events, merging multiple monitoring traces into a global one, and using a database to store (parts of) traces and additional information for the longer term. Modern industrial implementations of this concept, like Apache Flink [1], are built for massively parallel stream processing in distributed systems, supporting arbitrary applications but providing no special abstractions for monitoring, in contrast to more run-time-monitoring-focused implementations like ReMinds [58]. Complex event recognition systems also sometimes provide capabilities for loadshedding [59], of which autodrop buffers are the simplest version. Most event recognition systems provide more features than Vamos, but are also harder to set up for monitoring; in contrast, Vamos offers a simple specification language that is efficient and still flexible enough for many monitoring scenarios.

Stream Run-Time Verification LoLa [19,24], TeSSLa [41], and Striver [27] are stream run-time verification [8] systems that allow expressing a monitor as a series of mutually recursive data streams that compute their current values based on each other's values. This requires some global notion of time, as the streams are updated with new values at time ticks and refer to values in other streams relative to the current tick, which is not necessarily available in a heterogeneous setting. Stream run-time verification systems also do not commonly support handling variable numbers of event sources. Some systems allow for dynamically instantiating sub-monitors for parts of the event stream [3,6,49,24] in a technique called parametric trace slicing [15]. This is used for logically splitting the events on a given stream into separate streams, making them easier to reason about, and can sometimes be exploited for parallelizing the monitor's work. These additional streams are internal to the monitoring logic, in contrast, Vamos' ability to dynamically add new event sources affects the monitoring system's outside connections, while, internally, the arbiter still unifies the events coming in on all such connections into one global stream.

Instrumentation The two key questions in instrumentation revolve around the technical side of how a monitor accesses a monitored system as well as the behavioral side of what effects these accesses can have. On the technical side, static instrumentation can be either applied to source code [39,30,36,37,40,34] or compiled binaries [23,20], while dynamic instrumentation, like DyanmoRIO, is applied to running programs [43,46,9]. Alternatively, monitored systems or the platforms they run on may have specific interfaces for monitors already, such as PTrace and DTrace [10,28,50] in the Linux kernel. Any of these can be used to create an instrumentation tool for Vamos.

On the behavioral side, Cassar et al. surveyed various forms of instrumentation between completely synchronous and offline [11]. Many of the systems surveyed [21,4,14,16] use a form of static instrumentation that can either do the necessary monitoring work while interrupting the program's current thread whenever an event is generated, or offer the alternative of using the interruption to export the necessary data to a log to be processed asynchronously or offline. A mixed form called Asynchronous Monitoring with Checkpoints allows stopping the monitored system at certain points to let the monitor catch up [25]. Our autodrop buffers instead trade precision for avoiding this kind of overhead. Aside from the survey, some systems (like TeSSLa [41]) incrementalize their default offline behavior to provide a monitor that may eventually significantly lag behind the monitored system.

Executing monitoring code or even just writing event data to a file or sending it over the network is costly in terms of overhead, even more so if multiple threads need to synchronize on the relevant code. Ha et al. proposed Cache-Friendly Asymmetric Buffering [29] to run low-overhead run-time analyses on multicore platforms. They only transfer 64-bit values, which suffices for some analyses, but not for general-purpose event data. Our adapted implementation thus has to do some extra work, but shares the idea of using a lock-free single-producer-singleconsumer ring buffer for low overhead and high throughput.

While we try to minimize it, we accept some overhead for instrumentation as given. Especially in real-time systems, some run-time monitoring solutions adjust the activation status of parts of the instrumentation according to some metrics of overhead, inserting hole events for phases when instrumentation is deactivated [5,31,2]. In contrast, the focus of load-shedding through autodrop buffers is on ensuring that the higher-level part of the monitor is working with reasonably up-to-date events while not forcing the monitored system to wait. For monitors that do not rely on extensive summarization of dropped events, the two approaches could easily be combined.

Monitorability and Missing Events Monitorability [38,47] studies the ability of a runtime monitor to produce reliable verdicts about the monitored system. The possiblity of missing arbitrary events on an event stream without knowing about it significantly reduces the number of monitorable properties [35]. The autodrop buffers of Vamos instead insert hole information, which some LTL [32], TeSSLa [42], and Mealy machine [54] specifications can be patched to handle automatically. Run-time verification with state estimation [53] uses a Hidden Markov Model to estimate the data lost in missing events.

### 8 Conclusion

We have presented Vamos, which we designed as middleware for best-effort third-party run-time monitoring. Its goal is to significantly simplify the instrumentation part of monitoring, broadly construed as the gathering of high-level observations that serve as the basis for traditional monitoring specifications, particularly for best-effort third-party run-time monitoring, which may often need some significant preprocessing of the gathered information, potentially collected from multiple heterogeneous sources. We have presented preliminary evidence that the way we built Vamos can handle large numbers of events and lets us specify a variety of monitors with relative ease. In future work, we plan to apply Vamos' to more diverse application scenarios, such as multithreaded webservers processing many requests in parallel, or embedded software, and to integrate our tools with other higher-level languages. If a system's behavior conforms to the expectation of a third party, this is generally recognized as inspiring a higher level of trust than if that monitor was written by the system's developers. We hope that our design can help making best-effort third-party run-time monitoring more common.

Acknowledgements This work was supported in part by the ERC-2020-AdG 101020093. The authors would like to thank the anonymous FASE reviewers for their valuable feedback and suggestions.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Yet Another Model! A Study on Model's Similarities for Defect and Code Smells

Geanderson Santos1() , Amanda Santana<sup>1</sup> , Gustavo Vale<sup>2</sup> , and Eduardo Figueiredo<sup>1</sup>

<sup>1</sup> Federal University of Minas Gerais, Belo Horizonte, Brazil {geanderson,amandads,figueiredo}@dcc.ufmg.br <sup>2</sup> Saarland University, Saarbr¨ucken, Germany vale@cs.uni-saarland.de

Abstract. Software defect and code smell prediction help developers identify problems in the code and fix them before they degrade the quality or the user experience. The prediction of software defects and code smells is challenging, since it involves many factors inherent to the development process. Many studies propose machine learning models for defects and code smells. However, we have not found studies that explore and compare these machine learning models, nor that focus on the explainability of the models. This analysis allows us to verify which features and quality attributes influence software defects and code smells. Hence, developers can use this information to predict if a class may be faulty or smelly through the evaluation of a few features and quality attributes. In this study, we fill this gap by comparing machine learning models for predicting defects and seven code smells. We trained in a dataset composed of 19,024 classes and 70 software features that range from different quality attributes extracted from 14 Java open-source projects. We then ensemble five machine learning models and employed explainability concepts to explore the redundancies in the models using the top-10 software features and quality attributes that are known to contribute to the defects and code smell predictions. Furthermore, we conclude that although the quality attributes vary among the models, the complexity, documentation, and size are the most relevant. More specifically, Nesting Level Else-If is the only software feature relevant to all models.

Keywords: Defect Prediction · Code Smells Detection · Explainable Machine Learning · Quality Attributes

# 1 Introduction

Software defects appear in different stages of the life-cycle of software systems degrading the software quality and hurting the user experience [25]. Sometimes, the damage caused by software defects is in-reversible [44]. As consequence, the software cost increases as developers need time to fix defects [43]. As a result, it is better to avoid them as much as possible. Several studies showed that the presence of code smells and anti-patterns are normally related to defecting code [24,34,49,51]. Code smells are symptoms of decisions on the implementation that may degrade the code quality [22]. Anti-patterns are the misuse of solutions to recurring problems [9]. For instance, Khomh et al. (2012) found that classes classified as God Classes are more defect-prone than classes that are not smelly. In this paper, we refer to code smells and anti-patterns as code smells.

One technique to mitigate the impact of defects and code smells is the application of strategies that anticipate problematic code [47], usually with the use of machine learning models that predict a defect or code smell [12,13,14,26,35,45,47,52,73]. Training and evaluating machine learning models is a hard task, since (i) it needs a large dataset, to avoid overfitting; (ii) the process of obtaining the labels and features to serve as input is costly, and it requires the use of different tools to support it; (iii) setting up the environment for training and evaluating models is time-consuming and computationally expensive, even though some tools help to automatize the process, and; (iv) understanding the importance of the features and how they affect the model is complex [39].

With these difficulties in mind, our goal is to identify a set of features that can be used by developers to simplify the process of defect and code smell prediction. To simplify, we aim at reducing the number of features that need to be collected to predict or identify possible candidates to present defects and code smells, through an analysis of model redundancies. To the best of our knowledge, no other studies have investigated similarities between the defect and code smell models. Instead, most studies focus on proposing and assessing the performance of models that predict defects or code smells [27,35,41,44]. In this work, we fill this gap through an analysis of which features are redundant or different in models built for defects and for seven code smells. Even more, we highlight which quality attributes are relevant to their prediction. This analysis is possible by the use of the SHAP technique, which determines the contribution of each feature to the prediction. As a result, using SHAP allows the verification of the features that contributed the most to the prediction and whether the features had high or low values.

To achieve our goal, we use a subset of 14 open-source Java systems that had its features and defects annotated [15,16]. We then employ the Organic tool [48] to detect nine code smells. We merged three of these smells due to similar definitions. After merging the data, we train and evaluate an ensemble machine learning model composed of five algorithms for each of our targets, i.e., defects and code smells. After evaluating the performance of our ensemble, we apply the SHAP technique to identify which features are relevant for each model. Finally, we analyze the results in terms of: (i) which features are relevant for each model; (ii) which features contribute the most for two or more models to identify redundancies in the models; (iii) which quality attributes are important to the defect and code smell prediction.

Our main findings are: (i) from the seven code smells evaluated, we identified that the most similar models to the Defect are the God Class, Refused Bequest, and Spaghetti Code; (ii) Nesting Level Else-If (NLE) and Comment Density (CD) are the most important features; (iii) most features have high values, except on Refused Bequest; (iv) we identified sets of features that are common in trios of problems, such as API Documentation (AD), which is important for Defects, God Class, and Refused Bequest; (v) documentation, complexity, and size are the quality attributes that contribute the most for the prediction of defects and code smells; (vi) the intersection of features between the defects and code smells ranges from 40% for Refused Bequest to 60% of the God Class. We also contributed to the community by providing an extension of the previous dataset of defects [15,16] through the addition of nine smells, available in our online appendix [64]. As a consequence of these analyses, we obtained a smaller set of features that contributes to the prediction of defects and code smells. Developers and researches may train machine learning models with less effort using these findings, or they may use these features to identify possible candidates for introducing defects and code smells.

We organize the remainder of this work as follows. Section 2 describes the background of our work. Section 3 shows how we structured the methodology. Then, Section 4 presents the results of our evaluation comparing the defect model with the code smells. Section 5 discusses the main threats to validity of our investigation. Section 6 presents the related work our investigation is based on. Finally, Section 7 concludes this paper with remarks for further explorations about the subject.

### 2 Background

### 2.1 Defects

A software defect represents an error, failure, or bug [1] in a software project, that harm the appearance, operation, functionality, or performance of the target software project [25]. Defects may appear on different development stages [71] and may interrupt the development progress and increase the planned budget of software projects [43]. Furthermore, a software team may discover software defects after code release, generating a significant effort to tackle defects in production [37]. To mitigate these defects in software development, defect prediction may find the defective classes [42,43,73] before system testing and release. For instance, if a software team has limited resources for software inspection, a defect predictor may indicate which modules are most likely to be defective.

### 2.2 Code Smells

Brown et al. [9] proposed a catalog of anti-patterns, that are solutions to recurring problems based on design patterns, but instead of providing reusable code, it impacts negatively on the source code. Later, Fowler [22] introduced the code smells as symptoms of sub-optimal decisions in the software implementation that leads to code quality degradation. Since our defect dataset is class-level, we only consider the problems related to classes. In our work, we considered the following smells: Refused Bequest (RB), Brain Class (BC), Class Data Should be Private (CP), Complex Class (CC), Data Class (DC), God Class (GC), Lazy Class (LC), Spaghetti Code (SC), and Speculative Generality (SG). The definitions of the problems presented in this paper are: God Class is a large class that has too many responsibilities and centralizes the module functionality [61]. Refused Bequest is a class that does not use its parent behavior [22]. Spaghetti Code is a class that has methods with large and unique multistage process flow [9]. Due to space constraints, the definitions of all evaluated problems can be found in our replication package [64].

### 3 Study Design

#### 3.1 Research Questions

In this paper, we investigate the similarities and redundancies between the software features used to predict defects and code smells. We can use this information to simplify the prediction model or identify possible candidates for introducing defects or smells. We employed data preparation to find the software features for the defect and code smell prediction models. Therefore, our main objective is to examine the software features applied for both predictions. Our paper investigates the following research questions.

RQ1. Are the defect and class-level code smell models explainable?


### 3.2 Data

Predicting a defect or a code smell is a supervised learning problem that requires a dataset with the values of the independent and dependent variables for each sample. Many datasets were proposed in the literature [13,31,44]; however, in this work, the selected dataset portrays a joined version of several resources publicly available in the literature [15,16,17,74]. In total, five data sources compose this dataset: PROMISE [65], Eclipse Bug Prediction [84], Bug Prediction Dataset [13], Bugcatchers Bug Dataset [24], and GitHub Bug Dataset [74] 3 . The dataset has classes from 34 open-source Java projects [77]. Furthermore, the data comprises 70 software features related to different aspects of the code. We can divide the features into seven quality attributes: documentation, coupling, cohesion, clone, size, complexity, and inheritance. We also highlight that the dataset is imbalanced. Only around 20% of the classes have a defect, and for the code smells, the range of classes they affect is between 4 to 16.2%. For these reasons, the dataset has a wide range of software features that may promote interesting

<sup>3</sup> https://zenodo.org/record/3693686

analysis of the defects and code smells. Finally, the open-source data facilitates the collection of code smells.

Data Collection. The first step of our study is to collect the data about the code smells to merge with the defect data [15]. We applied the Organic tool [48] to detect the code smells. As all projects are available on GitHub, we manually cloned the source code matching the project version included in the dataset. Since most of the systems in the original dataset have less than 1000 classes (20 systems), we collected data from the ones with more than 1000 classes (14 projects). We decided to focus on these projects because they represent 75% of the entire defect data and are readily available on GitHub. Additionally, we matched the name of the detected instances of code smells to the class name present in our defect dataset. Hence, independently of whether a class had a smell or not, we only consider it if the match was found in both datasets (i.e., the one with the defects and the one with the code smells). In the case that we could not find a match, we do not consider the class for further investigation. We use this approach to avoid bias as it would be unfair to determine that a class that Organic could not find in the defect dataset is non-smelly. Furthermore, this approach decreased the number of classes for most of the projects.



CP: Class Data Should be Private; DC: Data Class; GC: God Class; LC: Lazy Class; RB: Refused Bequest; SC: Spaghetti Code; SG: Speculative Generality.

Organic collects a wide range of code smells, including method and class ones. However, as the defect dataset is class-level, we only use the code smells found in classes. For this reason, we obtained the ground truth of nine smells, as described in Section 2.2. After collecting the data, we merged three code smells: Brain Class (BC), God Class (GC), and Complex Class (CC) into one code smell. Beyond the similar definitions, we merged the BC and CC to GC due to their low occurrence on the dataset. Hence, we named the code smell as God Class (GC), since it is more used in the literature [66]. Consequently, we evaluate seven smells in total.

Table 1 shows a summary of the data for each project. The first column presents the project's name. The second column presents the project version included in the dataset. The third column shows the number of classes for each system. Columns 4 through 10 show the number of smells found. The last column presents the number of defects in the system. The Total row presents the absolute number of classes and smelly/defective classes. The Percentage row presents the percentage of classes affected by smell/defect. We can observe from Table 1 that the projects vary in size, Lucene has the least classes (500), while Elasticsearch has the most (2605). We also observe that the number of smells and defects varies greatly for each system. For instance, the Xalan system has 456 instances of God Class and 947 defects. Meanwhile, even though the Neo4J is a large system, it had only 18 defects, i.e., 1% of its classes are defective.

Code Smells Validation. To validate the code smells collected with Organic, we conducted a manual validation with developers. First, we selected three of the most frequent code smells (GC, RB, and SC), since manual validation is costly and developers have to first understand the code. Then, we elaborate questions about each code smell based on the current literature: God Class (GC) [66], Refused Bequest (RB) [36] and Spaghetti Code (SC) [9]. We then produced a pilot study with four developers to improve the questions using classes that Organic classified as either one of the code smells. This allowed us to verify if the questions are suitable for our goals and whether the surveyed developers understood them. For each instance in our sample, we asked nine questions (3 for each smell). The developer was blind to which code smells they were evaluating and had four possible responses: "Yes", "No", "Don't Know", and "NA" (Not Applicable). The questions and developers' answers can be found in our replication package [64].

To make our validation robust, we calculated the sample size based on the number of instances for each of the three smells in our dataset. We then set a confidence level of 90% and a margin error of 10%. As a result, the sample size should have at least eighteen classes of each target code smell. Furthermore, to avoid biasing the analysis, we determine that two developers should evaluate each instance in our sample. In this case, developers had to validate 108 software classes (54 unique). To validate the 108 software classes, we invited fifteen developers from different backgrounds, including two co-authors. One of the authors was the moderator of the analysis and did not participate in the validation. As there were three questions for each smell, in order to consider the instance as truly containing the smell, developers needed to reach an agreement with the expected answer that supports the presence of the code smell on two out of three questions. In addition, if the two developers that evaluated the same instance disagreed on the presence of the smell, a third and more experienced developer checked the instance to make the final decision. This tiebreaker evaluation was done by two software specialists that did not participate in the previous validation.

In the end, the developers agree that all GC classified by the tool was correct (i.e., 18 out of 18 responses). For RB, the developers agree in 14 out of the 18 software classes (meaning that approximately 77% of developers agree with the tool). Finally, SC is slightly worse, where the developers classified 13 out of the 18 classes as SC. Thus, SC classes achieved an agreement of 72% between the developers and the tool. The results demonstrate that Organic can identify code smells with an appropriate level of accuracy (around 84% of agreement between them). For this reason, we conclude that the Organic data is adequate to represent code smells.

### 3.3 Quality Attributes

Although the literature proposes many quality attributes to group software features [4,8,68], we focus on the quality attributes previously discussed in the selected dataset [15,16]. These quality attributes cluster the entire collection of software features. Therefore, we separate the aforementioned software features into seven quality attributes: (i) Complexity, (ii) Coupling, (iii) Size, (iv) Documentation, (v) Clone, (vi) Inheritance, and (vii) Cohesion. Table 2 presents the quality attributes with their definition and reference. The complete list of software features (66 in total) and the quality attributes are available under the replication package of this study [64].



#### 3.4 Machine Learning

The predictive accuracy of machine learning classification models depends on the association between the structural software properties and a binary outcome. In this case, the properties are the software features widely evaluated in the literature [15,16], and the binary outcome is the prediction if the class is defective or non-defective or if the class presents each of the evaluated code smells. To compare the defect and code smell prediction models, we rely on the same set of software features, i.e., the models are trained with the same 66 measures, except on the target representing the presence/absence of defect/code smell. We train each machine learning model for each target (i.e., defect and code smell). To build these models, we employ a tool known as PyCaret [6] to assist in the different parts of the process, described later. Finally, to test the capacity of the models, we apply five evaluation metrics: accuracy, recall, precision, F1, and AUC [11].

Data Preparation. To build our models, we follow these fundamental steps described in Figure 1. The three rounded rectangles indicate the steps and the actions we performed to build the models. First, we clean the data (i). Then, we explore the data identifying how better to represent them for our models (ii). After, we prepare the features to avoid overfitting (iii).

Fig. 1. Data Preparation Process Overview.

Data Cleaning. We first applied data cleaning to eliminate duplicated classes, non-numeric data, and missing values [56]. Hence, it was possible to vertically reduce the data as we removed a small chunk of repeated entries (61 classes). Further, we also reduced the horizontal dimension of the data from 70 to 65 features eliminating the non-numeric features. We also removed four overrepresented software features. These software features gathered information about the exact line and column of the source code a class started and ended. In the end, we executed data imputation to track the missing values, but the dataset had none.


Training the Models. To build our classifier, we employ a technique known as the ensemble machine learning model [6]. This technique learns how to best combine the predictions from multiple machine learning models. Thus, we use a stronger machine learning model in terms of prediction, since it combines the prediction power of multiple models. To train the models, we divided the dataset into two sets: 70% of the data is used for training the models, and 30% for testing the models. To assess the performance of our models, we employed a method called k-fold cross-validation. This technique splits the data into K partitions. In our work, we used K=10 [11], and at each iteration, we use nine folds for training and the remaining fold for validation. We then permute these partitions on each iteration. As a result, we use each fold as training and as the validation set at least once. This method allows us to compare distinct models, helping us to avoid overfitting, as the training set varies on each iteration.

To identify which models are suitable to our goal, we evaluated 15 machine learning algorithms: CatBoost Classifier [6], Random Forest [23], Decision Tree [16], Extra Trees [6], Logistic Regression [29], K-Neighbors Classifier (KNN) [80], Gradient Boosting Machine [83], Extreme Gradient Boosting [63], Linear Discriminant Analysis [6], Ada Boost Classifier [55], Light Gradient Boosting Machine (LightGBM) [32], Naive Bayes [75], Dummy Classifier [55], Quadratic Discriminant Analysis [6], and Support Vector Machines (SVM) [23]. Furthermore, to tune the hyper-parameters of each model, we apply a technique called Optuna [5]. Optuna uses Bayesian optimization to find the best hyper-parameters for each model. After experimenting with all the targets, we observed that five models are able of achieving good performance independently of the target (i.e., defects or code smells): Random Forest [23], LightGBM [32], Extra Trees [10], Gradient Boosting Machine [72], and KNN [80]. For this reason, these models are carried out for the ensemble model. The data on the performance of the evaluated models can be found in our replication package [64]. To evaluate our models, we focus on the F1 and AUC metrics. F1 represents the harmonic mean of precision and recall. Additionally, AUC is relevant because we are dealing with binary classification and this metric shows the performance of a model at all thresholds. For these reasons, both metrics are suitable for the imbalanced nature of data [11].

Explaining the Models. The current literature offers many possibilities to explain machine learning models in multiple problems. One of the most prominent techniques spread in the literature is the application of SHAP (SHapley Addictive exPlanation) values [39]. These values compute the importance of each feature in the prediction model. Therefore, we can reason why a machine learning model made such decisions about the specific domain. For this reason, SHAP is appropriate as machine learning models are hard to explain [69] and features interact in complex patterns to create models that provide more accurate predictions. Consequently, knowing the logic behind a software class is a determinant factor that can help to tackle the reasons behind a defect or code smell in the target class.

### 4 Results

### 4.1 Predictive Capacity

Before explaining the models, we evaluate if they can effectively predict the code smells and defects. Even though we originally built models for the entire set of code smells, we observed that only three code smells (God Class, Refused Bequest, and Spaghetti Code) have comparable models to the defects. For this reason, we only present the results of these three code smells. We believe some code smells are not similar to the defect model because they indicate simple code with less chance of having a defect, for instance, Lazy Class and Data Class. As a result, these code smells seem to not have similarities with the defects. The remaining code smells results are available in the replication package [64].


Table 3. Performance of the Machine Learning Models.

Table 3 shows the performance of each ensemble machine learning model with our four targets (i.e., defects and the three code smells). The values in the columns represent the mean of the 10-fold cross-validation. We present in each column the performance for the five evaluation metrics. We can observe from Table 3 that the performance of the ensemble model for the four targets is fairly acceptable, with models presenting an F1 score ranging from approximately 65% (defect model) to 82% (God Class model). These numbers are similar to other studies with similar purposes [15,16]. We conclude that the models can predict the targets with acceptable accuracy, as shown by the high AUC values in Table 3. For this reason, we may exploit these machine learning models to explain their prediction using the SHAP technique. In doing so, we can reason about the similarities of the software features associated with defects and code smell.

RQ1. The results show that the predictive accuracy of the defect and code smell models can be used to compare the models in terms of their features, with good F1 measures and high AUC. We also found that the class-level code smell models are slightly superior to the defect model in all five evaluation metrics.

### 4.2 Explaining the Models

This section discusses the explanation of each target model. We rely on SHAP to support the model explanation [39]. To simplify our analysis, we consider the top-10 most influential software features on the target in each prediction model. We then compare each code smell model with the defective one. Our goal is to find similarities and redundancies between the software features that help the machine learning model to predict the target code smells and defects. We extract these ten software features from each of the four target models (i.e., the defect model and the three code smell models presented in this paper).

To illustrate our results, we employ a Venn diagram to check the intersection of features between the four models (Figures 2, 3, and 4). The Venn diagram displays two dashed circles, one for the code smell model and another for the defect model. Inside each dashed circle, we present the top-10 software features that contributed the most to the prediction of the target within inner circles. The color of these inner circles represents the feature's quality attribute. Likewise, the size of the inner circle represents the influence of the feature on the model, meaning that the bigger the size, the more it contributes to the target prediction. On each side of the inner circles, we have an arrow that indicates the direction of the feature value. For instance, a software feature with an arrow pointing up means that the software feature contributes to the prediction when its value is high. On the other hand, a software feature with an arrow pointing down means that the feature contributes to the prediction when its value is low. The software features on the intersection have two inner circles because they have a different impact on each target (i.e., defects and the three code smells). For a better understanding of the acronyms, we show on the right side of each diagram, a table with the acronym and the feature full name of all features that appears on the diagram.

God Class. Figure 2 shows the top-10 features that contribute to the Defect and God Class models, and their feature intersection. We can observe from Figure 2 that the defect model has an intersection with God Class of 6 out of 10 features. This means that 60% of the top-10 features that contribute the most to predictions are the same for both models. These features are: CD, CLOC, AD, NL, NLE, and CLLC; and most of them are related to documentation (3 out of 6) and complexity (2 out of 6). The only difference is for the CD, which needs to have low values to help in the God Class prediction. All remaining software features require a high value to predict a defect or a God Class (see arrows up). Moreover, in terms of importance, for both models, the largest inner circles are for NLE, NL, and AD. For the AD, its importance is smaller for the GC model compared to the defect model. Meanwhile, for the NLE, the importance of God Class is a bit larger than for the defect model. For the NL feature, their importance was equivalent.

Fig. 2. Top-10 Software Features for the Defect and God Class Models.

Refused Bequest. Figure 3 shows the top-10 features that contribute the most to the Defect and Refused Bequest models. We can observe from the Venn diagram in Figure 3 that the defect model has an intersection of 40% (4 out of 10 features) with the Refused Bequest model when considering their top-10 software features. The features that intersect are CD, AD, NLE, and DIT. It is interesting to notice that for 3 out of the 4 software features in the intersection, the values that help to detect the Refused Bequest have to be low (see arrows pointing down), while for the defect model, all of them require to have high values. Furthermore, most of the Refused Bequest features have to be low (6 or 60%). In terms of importance, the DIT and NLE features were similar for both models. However, for both CD and AD, their contribution to the Refused Bequest model was smaller. Additionally, two features that highly contributed to the Refused Bequest are not in the intersection (NOP and NOA), while one (NL) is outside the intersection for the defect model. We also note that three features are related to the inheritance quality attribute, but only one intersects for both models, the DIT one. We also observe that the size is relevant for both models. However, we do not have any size feature on the intersection of the models. The cohesion aspect was important only for the Refused Bequest model. The documentation attribute, which is relevant for the defect model (4 out of 10), has two of them with small importance (CLOC and PDA). The complexity attribute, indicated by NLE, is also very relevant for both models. CBO is the only coupling metric in the Refused Bequest model.

Fig. 3. Top-10 Software Features for the Defect and Refused Bequest Models.

Spaghetti Code. Figure 4 presents the 10 features that are most important to the Defect and Spaghetti Code models. We observe in Figure 4 that the Spaghetti Code model has 50% of intersection with the defect model. They intersect with the CD, CLOC, CLLC, NL, and NLE features. For both models, most features need high values, except one for Spaghetti Code, the CD. The features NL, NLE, and CLOC had similar importance. On the other hand, the CD feature contributes less to the Spaghetti Code. Meanwhile, the CLLC feature contributes less to the defect model than to the Spaghetti Code model. It is interesting to notice that most features that highly contribute to the Spaghetti Code prediction are outside the intersection (NOI, TNOS, and CBO). Furthermore, the complexity quality attribute intersects both models (i.e., 2 out of 5). In addition, two of the documentation features on the defect model are important for the Spaghetti Code model. In terms of clone duplication, it also intersects half of the features of the Spaghetti Code model (CLLC). The size is relevant for both models, but none of the features intersects (2 out of 10 for both models). The features TLOC and NLG appear on the defect model, while the TNOS and TNLA on the Spaghetti Code model. The coupling is exclusive to the Spaghetti Code model, while the inheritance is exclusive to the defect model.

After observing the three figures (Figures 2, 3, and 4), we notice some intersections between the four models. For instance, CLOC is important for Defect, God Class, and Spaghetti Code models, even though the importance for God

Fig. 4. Top-10 Software Features for the Defect and Spaghetti Code Models.

Class was smaller (see inner circle sizes). For this trio, we also have that NL and CLLC are important for the three models, but the CLLC has a small contribution in comparison to other features. For the Defect, God Class, and Refused Bequest, we highlight that the AD feature has high importance for all three models. Meanwhile, we also have some intersections between smells models. For the God Class and Spaghetti Code pair, we note that both NOI and TNOS are highly relevant to the models. Finally, CBO is important for the God Class, Refused Bequest, and Spaghetti Code, but with moderate importance.

RQ2. There is a group of software features that intersect between the defect models and the three code smells. More importantly, Nesting Level Else-If (NLE) and Comment density (CD) appear in the four models, although the CD influence is considerably low for the evaluated code smells. Furthermore, CBO is important for all the code smells, but not the defect model.

Figure 5 presents the number of features that correspond to the evaluated quality attributes according to the top-10 features discovered by SHAP. We stack each quality attribute horizontally to facilitate the comparison between them. Hence, our results indicate that practitioners do not need to concentrate on all software features to predict defects and the investigated code smells. A subset of features is enough to predict the targets. For instance, software features related to the documentation are the most relevant for the Defect and God Class models, with 4 and 3 features on the top-10, respectively. The Refused Bequest model needs software features related to the inheritance (3 features), but size and documentation are also relevant with two features each. Meanwhile, the Spaghetti Code model is the most comprehensive, requiring features linked to documentation, size, complexity, coupling, and clone duplication, with all of them having two features.

Based on the results discussed, we conclude that the four ensemble machine learning models require at least one software feature related to documentation (CD) and complexity (NLE) to predict the target. Hence, future studies about

Fig. 5. Comparison between the Top-10 Features of each Target.

defect and code smell prediction, independently of the dataset and domain, could focus on these two feature collections. Furthermore, as we can observe in Figure 5, considering all the machine learning models evaluated, the documentation, complexity, and size are the most important quality attributes that contribute to the detection of defects and the code smell.

RQ3. The most relevant quality attributes to predict defects and code smells vary greatly between each model. For instance, documentation is more important for the Defect and God Class models, while Spaghetti Code has all of its five quality attributes with the same importance, and Refused Bequest prioritizes the inheritance. In general, documentation, complexity, and size contribute more to the prediction of defects and the investigated code smells.

# 5 Threats to Validity


the defects and code smell. In this case, we limit the scope to the Java programming language to make our analysis feasible. However, we selected relevant systems that vary in domains, maturity, and development practices. For this reason, we cannot guarantee that our results generalize to other programming languages.


### 6 Related Work

Defect Prediction. Several studies [42,75] share the ability of applying code metrics for defect prediction. They vary in terms of accuracy, complexity, target programming language, input prediction density, and machine learning models. Menzies et al. [42] presented defect classifiers using code attributes defined by McCabe and Halstead metrics. They concluded that the choice of the learning method is more important than which subset of the available data we use for learning the software defects. In a similar approach, Turhan et al. [75] used cross-company data for building localized defect predictors. They used principles of analogy-based learning to cross-company data to fine-tune these models for localization and used static code features extracted from the source code, such as complex software features and Halstead metrics. They concluded that crosscompany data are useful in extreme cases and when within-company data is not available [75].

In the same direction, the study of Turhan et al. [76] evaluate the effect of mixing data from different projects stages. In this case, the authors use within and cross-project data to improve the prediction performance. They show that mixing project data based on the same project stage does not significantly improve the model performance. Hence, they concluded that optimal data for defect prediction is still an open challenge for researchers [76]. Similarly, He at al. [27] investigate defect prediction based on data selection. The authors propose a brute force approach to select the most relevant data for learning the software defects. To do so, they experiment with three large-scale experiments on 34 datasets obtained from ten open-source projects. They conclude that training data from the same project does not always help to improve the prediction performance [27]. On the other hand, we base our investigation on ensemble learning to improve the prediction performance and a wide set of software features.

Code Smells Prediction. Several automated detection strategies for code smells, and anti-patterns were proposed in the literature [18]. They also use diverse strategies in their identification. For instance, some methods are based on combination of metrics [48,57]; refactoring opportunities [19]; textual information [54]; historical data [52]; and machine learning techniques [7,12,14,20,21,35,40,41]. Khomh et al. [35] used Bayesian Belief Networks to detect three anti-patterns. They trained the models using two Java open-source systems. Maiga et al. [41] investigated the performance of the Support Vector Machine trained in three systems to predict four anti-patterns. Later, the authors introduced a feedback system to their model [40]. Amorim et al. [7] investigated the performance of Decision Trees to detect four code smells in one version of the Gantt project. Differently from these works, our dataset is composed of 14 systems, and we evaluate 9 code smells at the class level.

Cruz et al. [12] evaluated seven models to detect four code smells in 20 systems. The authors found that algorithms based on trees had a better F1 score than other models. Fontana et al. [20] evaluated six models to predict four smells. However, they have used the severity of the smells as the target. They reported high-performance numbers of the evaluated models. Later, Di Nucci et al. [14] replicated it [20] to address several limitations that potentially generated bias on the models' performance. Thus, the authors found out that the models' performance, when compared to the reference study, was 90% lower, indicating the need to further explore how to improve code smell prediction. Differently from previous work on code smell prediction, we are interested in exploring the similarities and differences between models for predicting code smells, in contrast with the models for defect prediction.

Defects and Code Smells. Several works tried to understand how code smells can affect software, evaluating different aspects of quality, such as maintainability [21,67,82], modularity [62], program comprehension [2], change-proneness [33,34], and how developers perceive code smells [53,81]. Other studies aim to evaluate how code smells impact the defect proneness [24,28,34,49,50,51]. Olbrich et al. [49] evaluated the fault-proneness evolution of the God Class and Brain Class of three open-source systems. They discovered that classes with these two smells can be more faulty, however, this did not hold for all analyzed systems. Similarly, Khomh et al. [34] evaluated the impact on fault-proneness of 13 different smells in several versions of three large open-source systems. They report the existence of a relationship between some code smells with defects, but it is not consistent for all system versions. Openja et al. [50] evaluated how code smells can make the class more fault-prone in quantum projects. Differently from these studies, we aim to understand whether models build for defects and code smells are similar or not.

Hall et al. [24] investigated if files with smells present more defects than files that do not have them. They found that for most of these smells, there is no statistical difference between smelly and non-smelly classes. Palomba et al. [51] evaluated how 13 code smells affect the presence of defects using a dataset of 30 open-source java systems. They reported that classes with smells have more bug fixes than classes that do not have any smells. Jebnoun et al. [28] evaluated how Code Clones are related to defects in three different programming languages. They concluded that smelly classes are more defect prone, but it varies according to the programming language. Differently from these three studies, we aim to understand how the prediction of defects differs from the models used to detect code smells, not on establishing a correlation between defect and code smell.

Explainable Machine Learning for Software Features. Software defect explainability is a relatively recent topic in the literature [30,46,58]. Mori and Uchihira [46] analyzed the trade-off between accuracy and interpretability of various models. The experimentation displays a comparison between the balanced output that satisfies both accuracy and interpretability criteria. Likewise, Jiarpakdee et al. [30] empirically evaluated two model-agnostic procedures, Local Interpretability Model-agnostic Explanations (LIME) [60] and BreakDown techniques. They improved the results obtained with LIME using hyperparameter optimization, which they called LIME-HPO. This work concludes that modelagnostic methods are necessary to explain individual predictions of defect models. Finally, Pornprasit et al. [58] proposed a tool that predicts defects for systems developed in Python. The input data consists of software commits, and the authors compare its performance with the LIME-HPO [30]. They conclude that the results are comparable to the state-of-the-art technology to explain models.

### 7 Conclusion

In this work, we investigated the relationship between defects and code smell machine learning models. To do so, we identified and validated the code smells collected with Organic. Then, we applied an extensive data processing step to clean the data and select the most relevant features for the prediction model. Subsequently, we trained and evaluated the models using an ensemble of models. In the end, as the models performed well, we employed an explainability technique to understand the models' decisions known as SHAP. We concluded that among the seven code smells initially collected, only three of them were similar to the defect model (Refused Bequest, God Class, and Spaghetti Code). In addition, we found that the features Nesting Level Else-If and Comment Density were relevant for the four models. Furthermore, most features require high values to predict defects and code smells, except for the Refused Bequest. Finally, we reported that the documentation, complexity, and size quality attributes are the most relevant for these models. In the future steps of this investigation, we can compare the SHAP results with other techniques (e.g., Lime) and employ white-box models to simplify the explainability. Another potential application of our study is the comparison between the reported code smells with other tools. We encourage the community to further investigate and replicate our results. For this reason, we made all data available under the replication package [64].

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Competition Contributions**

# Software Testing: 5th Comparative Evaluation: Test-Comp 2023

Dirk Beyer(B)

LMU Munich, Munich, Germany

Abstract. The 5th edition of the Competition on Software Testing (Test-Comp 2023) provides again an overview and comparative evaluation of automatic test-suite generators for C programs. The experiment was performed on a benchmark set of 4 106 test-generation tasks for C programs. Each test-generation task consisted of a program and a test specification (error coverage, branch coverage). There were 13 participating test-suite generators from 6 countries in Test-Comp 2023.

Keywords: Software Testing · Test-Case Generation · Competition · Program Analysis · Software Validation · Software Bugs · Test Validation · Test-Comp · Benchmarking · Test Coverage · Bug Finding · Test Suites · SV-Benchmarks · BenchExec · TestCov · CoVeriTeam

# 1 Introduction

In its 5th edition, the International Competition on Software Testing (Test-Comp, https://test-comp.sosy-lab.org, [7,8,9,10,11]) again compares automatic test-suite generators for C programs, in order to showcase the state of the art in the area of automatic software testing. This competition report is an update of the previous reports, referring to the rules and definitions, presents the competition results, and give some interesting data about the execution of the competition experiments. We use BenchExec [24] to execute the benchmarks and the results are presented in tables and graphs on the competition web site (https://test-comp.sosy-lab.org/2023/results) and are available in the accompanying archives (see Table 3).

Competition Goals. In summary, the goals of Test-Comp are the following [8]:

• Establish standards for software test generation. This means, most prominently, to develop a standard for marking input values in programs, define an exchange format for test suites, agree on a specification language for test-coverage criteria, and define how to validate the resulting test suites.

This report extends previous reports on Test-Comp [7,8,9,10,11].

Reproduction packages are available on Zenodo (see Table 3).

<sup>(</sup>B) dirk.beyer@sosy-lab.org


Related Competitions. In the field of formal methods, competitions are respected as an important evaluation method and there are many competitions [5]. We refer to the report from Test-Comp 2020 [8] for a more detailed discussion and give here only the references to the most related competitions [5,13,46,48].

# 2 Definitions, Formats, and Rules

Organizational aspects such as the classification (automatic, off-site, reproducible, jury, training) and the competition schedule is given in the initial competition definition [7]. In the following, we repeat some important definitions that are necessary to understand the results.

Test-Generation Task. A test-generation task is a pair of an input program (program under test) and a test specification. A test-generation run is a non-interactive execution of a test generator on a single test-generation task, in order to generate a test suite according to the test specification. A test suite is a sequence of test cases, given as a directory of files according to the format for exchangeable test-suites.<sup>1</sup>

Execution of a Test Generator. Figure 1 illustrates the process of executing one test-suite generator on the benchmark suite. One test run for a test-suite generator gets as input (i) a program from the benchmark suite and (ii) a test specification (cover bug, or cover branches), and returns as output a test suite (i.e., a set of test cases). The test generator is contributed by a competition participant as a software archive in ZIP format. The test runs are executed centrally by the competition organizer. The test-suite validator takes as input the test suite from

<sup>1</sup> https://gitlab.com/sosy-lab/software/test-format

Fig. 1: Flow of the Test-Comp execution for one test generator (taken from [8])

Table 1: Coverage specifications used in Test-Comp 2023 (similar to 2019–2022)


the test generator and validates it by executing the program on all test cases: for bug finding it checks if the bug is exposed and for coverage it reports the coverage. We use the tool TestCov [23] <sup>2</sup> as test-suite validator.

Test Specification. The specification for testing a program is given to the test generator as input file (either properties/coverage-error-call.prp or properties/coverage-branches.prp for Test-Comp 2023).

The definition init(main()) is used to define the initial states of the program under test by a call of function main (with no parameters). The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language [36]) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered (typically used to obtain a standard test suite for quality assurance) and COVER EDGES(@CALL(foo)) means that a call (at least one) to function foo should be covered (typically used for bug finding). A complete specification looks like: COVER(init(main()), FQL(COVER EDGES(@DECISIONEDGE))).

Table 1 lists the two FQL formulas that are used in test specifications of Test-Comp 2023; there was no change from 2020 (except that special function \_\_VERIFIER\_error does not exist anymore).

Task-Definition Format 2.0. Test-Comp 2023 used again the task-definition format in version 2.0.

<sup>2</sup> https://gitlab.com/sosy-lab/software/test-suite-validator

License and Qualification. The license of each participating test generator must allow its free use for reproduction of the competition results. Details on qualification criteria can be found in the competition report of Test-Comp 2019 [9].

# 3 Categories and Scoring Schema

Benchmark Programs. The input programs were taken from the largest and most diverse open-source repository of software-verification and test-generation tasks <sup>3</sup> , which is also used by SV-COMP [13]. As in 2020 and 2021, we selected all programs for which the following properties were satisfied (see issue on GitLab <sup>4</sup> and report [9]):


This selection yielded a total of 4 106 test-generation tasks, namely 1 173 tasks for category Error Coverage and 2 933 tasks for category Code Coverage. The test-generation tasks are partitioned into categories, which are listed in Tables 6 and 7 and described in detail on the competition web site.<sup>6</sup> Figure 2 illustrates the category composition.

Category Error-Coverage. The first category is to show the abilities to discover bugs. The benchmark set consists of programs that contain a bug. We produce for every tool and every test-generation task one of the following scores: 1 point, if the validator succeeds in executing the program under test on a generated test case that explores the bug (i.e., the specified function was called), and 0 points, otherwise.

Category Branch-Coverage. The second category is to cover as many branches of the program as possible. The coverage criterion was chosen because many test generators support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation [35]. We produce for every tool and every test-generation task the coverage of branches of the program (as reported by TestCov [23]; a value between 0 and 1) that are executed for the generated test cases. The score is the returned coverage.

Ranking. The ranking was decided based on the sum of points (normalized for meta categories). In case of a tie, the ranking was decided based on the run time, which is the total CPU time over all test-generation tasks. Opt-out from categories was possible and scores for categories were normalized based on the number of tasks per category (see competition report of SV-COMP 2013 [6], page 597).

<sup>3</sup> https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks

<sup>4</sup> https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks/-/merge\_requests/774

<sup>5</sup> https://test-comp.sosy-lab.org/2023/rules.php

<sup>6</sup> https://test-comp.sosy-lab.org/2023/benchmarks.php

Fig. 2: Category structure for Test-Comp 2023; compared to Test-Comp 2022, sub-category Hardware was added to main category Cover-Error

# 4 Reproducibility

We followed the same competition workflow that was described in detail in the previous competition report (see Sect. 4, [10]). All major components that were used for the competition were made available in public version-control

Fig. 3: Benchmarking components of Test-Comp and competition's execution flow (same as for Test-Comp 2020)



Table 3: Artifacts published for Test-Comp 2023


repositories. An overview of the components that contribute to the reproducible setup of Test-Comp is provided in Fig. 3, and the details are given in Table 2. We refer to the report of Test-Comp 2019 [9] for a thorough description of all components of the Test-Comp organization and how we ensure that all parts are publicly available for maximal reproducibility.

In order to guarantee long-term availability and immutability of the testgeneration tasks, the produced competition results, and the produced test suites, we also packaged the material and published it at Zenodo (see Table 3).

The competition used CoVeriTeam [20] <sup>7</sup> again to provide participants access to execution machines that are similar to actual competition machines. The

<sup>7</sup> https://gitlab.com/sosy-lab/software/coveriteam


Table 4: Competition candidates with tool references and representing jury members; new indicates first-time participants, <sup>∅</sup> indicates hors-concours participation

competition report of SV-COMP 2022 provides a description on reproducing individual results and on trouble-shooting (see Sect. 3, [12]).

### 5 Results and Discussion

This section represents the results of the competition experiments. The report shall help to understanding the state of the art and the advances in fully automatic test generation for whole C programs, in terms of effectiveness (test coverage, as accumulated in the score) and efficiency (resource consumption in terms of CPU time). All results mentioned in this article were inspected and approved by the participants.

Participating Test-Suite Generators. Table 4 provides an overview of the participating test generators and references to publications, as well as the team representatives of the jury of Test-Comp 2023. (The competition jury consists of the chair and one member of each participating team.) An online table with information about all participating systems is provided on the competition web site.<sup>8</sup> Table 5 lists the features and technologies that are used in the test generators.

There are test generators that did not actively participate (e.g., tester archives taken from last year) and that are not included in rankings. Those are called hors-concours participations and the tool names are labeled with a symbol (<sup>∅</sup>).

Computing Resources. The computing environment and the resource limits were the same as for Test-Comp 2020 [8], except for the upgraded operating system: Each test run was limited to 8 processing units (cores), 15 GB of memory, and 15 min of CPU time. The test-suite validation was limited to 2 processing units,

<sup>8</sup> https://test-comp.sosy-lab.org/2023/systems.php


Table 5: Technologies and features that the test generators used

7 GB of memory, and 5 min of CPU time. The machines for running the experiments are part of a compute cluster that consists of 168 machines; each test-generation run was executed on an otherwise completely unloaded, dedicated machine, in order to achieve precise measurements. Each machine had one Intel Xeon E3- 1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86\_64-linux, Ubuntu 22.04 with Linux kernel 5.15). We used BenchExec [24] to measure and control computing resources (CPU time, memory, CPU energy) and VerifierCloud<sup>9</sup> to distribute, install, run, and clean-up test-case generation runs, and to collect the results. The values for time and energy are accumulated over all cores of the CPU. To measure the CPU energy, we use CPU Energy Meter [25] (integrated in BenchExec [24]). Further technical parameters of the competition machines are available in the repository which also contains the benchmark definitions. <sup>10</sup>

<sup>9</sup> https://vcloud.sosy-lab.org

<sup>10</sup> https://gitlab.com/sosy-lab/test-comp/bench-defs/tree/testcomp22


Table 6: Quantitative overview over all results; empty cells mark opt-outs; new indicates first-time participants, <sup>∅</sup> indicates hors-concours participation

One complete test-generation execution of the competition consisted of 50 445 single test-generation runs in 25 run sets (tester × property). The total CPU time was 315 days and the consumed energy 89.9 kWh for one complete competition run for test generation (without validation). Test-suite validation consisted of 53 378 single test-suite validation runs in 26 run sets (validator × property). The total consumed CPU time was 19 days. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 254 445 test-generation runs (consuming 3.0 years of CPU time). The prerun test-suite validation consisted of 338 710 single test-suite validation runs in 152 run sets (validator × property) (consuming 63 days of CPU time). The CPU energy was not measured during preruns.

New Test-Suite Generators. To acknowledge the test-suite generators that participated for the first time in Test-Comp, we list the test generators that participated for the first time. ESBMC-kindnew , FuSeBMC\_IAnew, and WASP-C new participated for the first time in Test-Comp 2023, and Legion/SymCC participated first in Test-Comp 2022. Table 8 reports also the number of subcategories in which the tools participated.


Table 7: Overview of the top-three test generators for each category (measurement values for CPU time and energy rounded to two significant digits)

Table 8: New test-suite generators in Test-Comp 2022 and Test-Comp 2023; column 'Sub-categories' gives the number of executed categories


Quantitative Results. The quantitative results are presented in the same way as last year: Table 6 presents the quantitative overview of all tools and all categories. The head row mentions the category and the number of test-generation tasks in that category. The tools are listed in alphabetical order; every table row lists the scores of one test generator. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the test generator opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site <sup>11</sup> and in the results artifact (see Table 3). Table 7 reports the top three test generators for each category. The consumed run time (column 'CPU Time') is given in hours and the consumed energy (column 'Energy') is given in kWh.

<sup>11</sup> https://test-comp.sosy-lab.org/2023/results

Fig. 4: Number of evaluated test generators for each year (top: number of first-time participants; bottom: previous year's participants)

Fig. 5: Quantile functions for category Overall. Each quantile function illustrates the quantile (x-coordinate) of the scores obtained by test-generation runs below a certain number of test-generation tasks (y-coordinate). More details were given previously [9]. The graphs are decorated with symbols to make them better distinguishable without color.

Score-Based Quantile Functions for Quality Assessment. We use scorebased quantile functions [24] because these visualizations make it easier to understand the results of the comparative evaluation. The web site <sup>11</sup> and the results artifact (Table 3) include such a plot for each category; as example, we show the plot for category Overall (all test-generation tasks) in Fig. 5. We had 11 test generators participating in category Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [6]). A more detailed discussion of score-based quantile plots for testing is provided in the Test-Comp 2019 competition report [9].

# 6 Conclusion

The Competition on Software Testing took place for the 5th time and provides an overview of fully-automatic test-generation tools for C programs. A total of 13 test-suite generators was compared (see Fig. 4 for the participation numbers and Table 4 for the details). This off-site competition uses a benchmark infrastructure that makes the execution of the experiments fully-automatic and reproducible. Transparency is ensured by making all components available in public repositories and have a jury (consisting of members from each team) that oversees the process. All test suites were validated by the test-suite validator TestCov [23] to measure the coverage. The results of the competition are presented at the 26th International Conference on Fundamental Approaches to Software Engineering at ETAPS 2023.

Data-Availability Statement. The test-generation tasks and results of the competition are published at Zenodo, as described in Table 3. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 2. For easy access, the results are presented also online on the competition web site https://test-comp.sosy-lab.org/2023/results.

Funding Statement. This project was funded in part by the Deutsche Forschungsgemeinschaft (DFG) — 418257054 (Coop).

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# FuSeBMC IA: Interval Analysis and Methods for Test Case Generation (Competition Contribution)

Mohannad Aldughaim1,4(B) , Kaled M. Alshmrany1,<sup>5</sup> , Mikhail R. Gadelha<sup>2</sup> , Rosiane de Freitas<sup>3</sup> , and Lucas C. Cordeiro1,<sup>3</sup>

> <sup>1</sup> University of Manchester, Manchester, UK 2 Igalia, A Coruna, Spain ˜ 3 Federal University of Amazonas, Manaus, Brazil <sup>4</sup> King Saud University, Riyadh, Saudi Arabia 5 Institute of Public Administration, Jeddah, Saudi Arabia mohannad.aldughaim@manchester.ac.uk

Abstract. The cooperative verification of Bounded Model Checking and Fuzzing has proved to be one of the most effective techniques when testing C programs. FuSeBMC is a test-generation tool that employs BMC and Fuzzing to produce test cases. In Test-Comp 2023, we present an interval approach to FuSeBMC IA, improving the test generator to use interval methods and abstract interpretation (via Frama-C) to strengthen our instrumentation and fuzzing. Here, an abstract interpretation engine instruments the program as follows. It analyzes different program branches, combines the conditions of each branch, and produces a Constraint Satisfaction Problem (CSP), which is solved using Constraint Programming (CP) by interval manipulation techniques called *Contractor Programming*. This process has a set of invariants for each branch, which are introduced back into the program as constraints. Experimental results show improvements in reducing CPU time (37%) and memory (13%), while retaining a high score.

Keywords: Automated Test-Case Generation · Bounded Model Checking · Fuzzing · Abstract Interpretation · Constraint Programming · Contractors.

# 1 Introduction

In Test-comp 2022 [1], cooperative verification tools showed their strength by being the best tools in each category. *FuSeBMC* [9,10] is a test-generation tool that employs cooperative verification using fuzzing and BMC. *FuSeBMC* starts with the analysis to instrument the Program Under Test (PUT); then, based on the results from BMC/AFL, it generates the initial seeds for the fuzzer. Finally, *FuSeBMC* keeps track of the goals covered and updates the seeds, while producing test cases using BMC/Fuzzing/Selective fuzzer. This year, we introduce abstract interpretation to *FuSeBMC* to improve the test case generation. In particular, we use interval methods to help our instrumentation and fuzzing by providing intervals to help reach (instrumented) goals faster. The selective fuzzer is a crucial component of *FuSeBMC*, which generates test cases for uncovered goals based on information obtained from test cases produced by BMC/fuzzer [9]. This work is based on our previous study, where CSP/CP by contractor techniques are applied to prune the state-space search [12]. Our approach also uses Frama-C [4,8] to obtain variable intervals, further pruning the state space exploration. Our original contributions are: (1) improve instrumentation to allow abstract interpretation to provide information about variable intervals; (2) apply interval methods to improve the fuzzing and produce higher impact test cases by pruning the search space exploration; (3) reduce the usage of resources (incl. memory and CPU time).

# 2 Interval Analysis and Methods for Test Case Generation

*FuSeBMC IA* improves the original *FuSeBMC* using Interval Analysis and Methods [3]. Fig. 1 illustrates the *FuSeBMC IA*'s architecture. Our approach starts from the analysis phase of *FuSeBMC* [9,10]. It parses statement conditions required to reach a goal, to construct a Constraint Satisfaction Problem/Constraint Programming (CSP/CP) [5] with three components: constraints (program conditions), variables (used in a condition), and domains (provided by the static analyzer Frama-C via eva plugin [7]). We instrument the PUT with Frama-C intrinsic functions to obtain the domains, which generate intervals of a given set of variables at a specific program location. Then, we apply the contractor to each goal's CSP and output the results to a file used by the selective fuzzer. Contractor Programming is a set of interval methods that estimate the solution

Fig. 1: *FuSeBMC IA*'s architecture. The changes introduced in *FuSeBMC IA* for Test-Comp 2023 are highlighted in green. The new Interval Analysis & Methods component generates intervals to be used by the selective fuzzer.

of a given CSP [5]. The used contractor technique is the Forward-Backward contractor, which is applied to a CSP/CP with a single constraint [3], which is implemented in the IBEX library [6]. IBEX is a C++ library for constraint processing over real numbers that implement contractors. More details regarding contractors can be found in our current work-in-progress [12].

Parsing Conditions and CSP/CP creation for each goal. While traversing the PUT clang AST [2], we consider each statement's conditions that lead to an injected goal: the conditions are parsed and converted from Clang expression [2] to IBEX expression [6]. The converted expressions are used as the constraints in CSP/CP to create a contractor. After parsing the goals, we have a CSP/CP for each goal. In case of a goal does not have a CSP/CP, the intervals for the variables are left unchanged. We also create a constraint for each condition in case of multiple conditions and take the intersection/union. At the end of this phase, we have a list of each goal and its contractor. Also, a list of variables for each contractor will be used to instrument the Frama-C file in the next phase.

Instrumented file Instrumented file for Frama-C Intervals file Fig. 2: The figure illustrates an example of files produced. We are starting from the instrumented file that shows the goals injected. Then, we instrument the file with the Frama-C intrinsic function. Finally, we produce a file with each goal and the intervals to satisfy the conditions for each goal.

Domains reduction. In this step, we attempt to reduce the domains (primarily starting from (−∞, ∞)) to a smaller range. This is done via Frama-C eva plugin (evolved value analysis) [7]. First, during the instrumentation, we make an instrumented file aimed to be used by Frama-C using its intrinsic functions Frama c show each() (cf. Fig. 2). This function allows us to add custom text to identify goals and how many variables are in each call. Second, we run Frama-C to obtain the new variable intervals. Finally, we update the domains for the corresponding CSP/CP.

Applying contractors. Contractors will help prune the domains of the variables by removing a subset of the domain that is guaranteed not to satisfy the constraints. With all the components for a CSP/CP available, we now apply the contractor for each goal and produce the output file in Figure 2. The result will be split per goal into two categories. The first category lists each variable and the possible intervals (lower bound followed by upper bound) to enter the condition given. The second category contains unreachable goals, i.e. when the contractor result is an empty vector.

Selective Fuzzer. The Selective Fuzzer parses the file produced by the analyzer, extracts all the intervals, applies these intervals to each goal, and starts fuzzing within the given interval. Thus, pruning the search space from random intervals to informed intervals. The selective fuzzer will also prioritize the goals with smaller intervals and set a low priority to goals with unreachable results.

# 3 Strengths and Weaknesses

Using abstract interpretation in *FuSeBMC IA* improved the test-case generation regarding resources. The new contractors generated by the Interval Analysis and Methods component are used by our selective fuzzer: (1) the information provided helps the selective fuzzer to start from a given range of values rather than a random range (as was our strategy in the previous version); (2) the selective fuzzer uses the information about unreachable goals to set their priority low for reachability; (3) when compared to *FuSeBMC* v4, this improvement helped saving CPU time by 37% and memory by 13%, which leads to saving 40% of energy; (4) although our approach produces fewer test cases for a given category, the impact of these test cases is higher in terms of reaching instrumented goals; (5) there is potential for future work to use the information provided by Frama-C, especially regarding overflow warnings. Finally, the intervals provided may not affect the *FuSeBMC IA*'s outcome in the worst case. i.e., the selective fuzzer performs no better than not having interval information for seed generation. The time it takes to generate the intervals is only a tiny fraction of the time it takes to produce the test cases; its impact when the information is not useful is negligible.

Our approach suffers from a significant technical limitation: *FuSeBMC IA* cannot create complementary contractors; we can only create intervals that satisfy the constraints of a branch (i.e., outer contractors). In practice, we can only create intervals to if-statements and ignore its else-statements (the inner contractor). We also skip any if-statement inside else-statements, as this may lead to unsound intervals. This is a technical limitation rather than a theoretical one: we use run-time type information (RTTI) to identify ibex expressions. However, we link our tool with Clang, which requires compilation with no RTTI information. We are investigating approaches to address this limitation, e.g., to encapsulate all ibex expressions and manually store expression information, but currently, no proper fix has been implemented. Additionally, a bug has been found that caused *FuSeBMC IA* to crash on some benchmarks that made *FuSeBMC IA* scores much less than *FuSeBMC* in the coverage category.

### 4 Tool Setup and Configuration

When running *FuSeBMC IA*, the user is required to set the architecture with -a, the property file path with -p, and the benchmark path, as:

```
fusebmc.py [-a {32, 64}] [-p PROPERTY FILE]
  [-s {kinduction,falsi,incr,fixed}][BENCHMARK PATH]
```
For Test-Comp 2023, *FuSeBMC IA* uses incr for incremental BMC, which relies on the ESBMC's symbolic execution engine [11]. The fusebmc.py and FuSeBMC.xml files are the Benchexec tool info module and the benchmark definition file respectively.

# 5 Software Project

*FuSeBMC IA* is publicly available on GitHub<sup>1</sup> under the terms of MIT License. In the repository, *FuSeBMC IA* is implemented using a combination of Python and C++. Build instructions and dependencies are all available in README.md file. *FuSeBMC IA* is a fork of the main project *FuSeBMC* available on GitHub<sup>2</sup> .

<sup>1</sup> https://github.com/Mohannad-Aldughaim/FuSeBMC IA

<sup>2</sup> https://github.com/kaled-alshmrany/FuSeBMC

# 6 Data-Availability Statement

All files necessary to run the tool are available on Zenodo [13].

# Acknowledgment

King Saud University, Saudi Arabia<sup>3</sup> supports the *FuSeBMC IA* development. The work in this paper is also partially funded by the UKRI/IAA project entitled "Using Artificial Intelligence/Machine Learning to assess source code in Escrow".

# References


<sup>3</sup> https://ksu.edu.sa/en/

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Author Index**

### **A**

Aguirre, Nazareno 3, 111 Aldughaim, Mohannad 324 Alshmrany, Kaled M. 324 Ansari, Saba Gholizadeh 151

### **B**

Baunach, Marcel 26 Bengolea, Valeria 111 Beyer, Dirk 309 Bianculli, Domenico 249 Bliudze, Simon 143 Brizzio, Matías 3 Burholt, Charlie 241

### **C**

Calinescu, Radu 241 Carvalho, Luiz 3 Cavalcanti, Ana 241 Chalupa, Marek 260 Cordeiro, Lucas C. 324 Cordy, Maxime 3

### **D**

d'Aloisio, Giordano 88 Dastani, Mehdi 151 Dawes, Joshua Heneage 249 de Freitas, Rosiane 324 Degiovanni, Renzo 3 Di Marco, Antinisca 88 Dignum, Frank 151 Din, Crystal Chang 220

### **E** El-Hokayem, Antoine 173

### **F**

Falcone, Yliès 173 Figueiredo, Eduardo 282 Frias, Marcelo F. 111

### **G**

Gadelha, Mikhail R. 324 Gopinath, Divya 133

### **H**

Haltermann, Jan 195 Henzinger, Thomas A. 260 Huisman, Marieke 143

### **J**

Jakobs, Marie-Christine 195 Jones, Maddie 241

# **K**

Kamburjan, Eduard 220 Keller, Gabriele 151 Kifetew, Fitsum Meshesha 151

### **L**

Larsen, Kim Guldstrand 26 Lei, Stefanie Muroya 260 Li, Zhe 67 Lorber, Florian 26 Lungeanu, Luca 133

### **M**

Mangal, Ravi 133 Molina, Facundo 111 Muehlboeck, Fabian 260

### **N**

Neele, Thomas 47 Nyman, Ulrik 26

© The Editor(s) (if applicable) and The Author(s) 2023 L. Lambers and S. Uchitel (Eds.): FASE 2023, LNCS 13991, pp. 331–332, 2023. https://doi.org/10.1007/978-3-031-30826-0

### **P**

Papadakis, Mike 3 P˘as˘areanu, Corina 133 Politano, Mariano 111 Ponzio, Pablo 111 Prandi, Davide 151 Prasetya, I. S. W. B. 151

### **R**

Ribeiro, Leandro Batista 26 Richter, Cedric 195 Rubbens, Robert 143

#### **S**

Safina, Larisa 143 Sammartino, Matteo 47 Santana, Amanda 282 Santos, Geanderson 282 Shin, Donghwan 249

Soueidi, Chukri 173 Stilo, Giovanni 88

#### **T**

Traon, Yves Le 3

### **V**

Vale, Gustavo 282 van den Bos, Petra 143

### **W**

Wehrheim, Heike 195

### **X**

Xie, Fei 67 Xie, Siqi 133

### **Y**

Yaman, Sinem Getir 241 Yu, Huanfeng 133